diff --git a/README.md b/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7fed49aabbd4550dc8761855ef450ab0d3191b0f
--- /dev/null
+++ b/README.md
@@ -0,0 +1,211 @@
+---
+license: apache-2.0
+language:
+- ar
+- en
+pipeline_tag: text-generation
+tags:
+ - pytorch
+library_name: transformers
+---
+# ALLaM-7B-Instruct-preview
+
+ALLaM is a series of powerful language models designed to advance Arabic Language Technology (ALT) developed by the National Center for Artificial Intelligence (NCAI) at the [Saudi Data and AI Authority (SDAIA)](https://sdaia.gov.sa/en/default.aspx). `ALLaM-AI/ALLaM-7B-Instruct-preview` is trained from scratch. Our pretraining from scratch recipe consists of two steps: training on 4T English tokens followed by training on 1.2T mixed Arabic/English tokens. This retains the English capabilities of the model without catastrophic forgetting, effectively transferring knowledge from one language distribution to another.
+
+## Intended Use
+
+`ALLaM` is specifically designed to expedite the research and development of ALT through Large Language Models (LLM). It serves as one of the foundational elements for building product offerings as well as facilitating experimental initiatives.
+
+The ALLaM series models are designed to be a component of a larger AI system, and it is important for developers to incorporate safety measures when creating these systems. These safety measures are crucial for ensuring a balance between effectiveness and security, as well as minimizing potential risks, such as those resulting from the integration of the model with external tools.
+
+## Model Details
+
+ALLaM is a family of LLMs specially trained for Arabic. The main two paths followed for pretraining are:
+
+- **ALLaM**: Pretraining models from scratch
+- **ALLaM-Adapted/ALLaM-(\*\*)/(\*\*)-ALLaM**/: Continued training from open source/weight models
+
+For this release, we are providing our instruction-tuned 7B parameter generative model pretrained from scratch.
+
+Some parameters for this model are provided in the following table:
+
+| Size | Context Length | Pretraining Tokens | Instructions | Preference Pairs |
+|----------------|-----------------|--------------------|--------------|------------------|
+| 7B parameters | 4096 tokens |4T(en) + 1.2T(en+ar)| 7M | 260K |
+
+
+## Model Description
+
+- **Developed by:** National Center for Artificial Intelligence at [SDAIA](https://sdaia.gov.sa/en/default.aspx)
+- **Model type:** Autoregressive Transformer
+- **Language(s):** Arabic, English
+- **License:** Please see the LICENSE file
+- **Input:** Text
+- **Output:** Text
+
+
+## Training Details
+
+ALLaM-7B-Instruct-preview is pretrained on a total of 5.2 trillion tokens in English and Arabic, Our training codebase is built on [NVIDIA/MegatronLM](https://github.com/NVIDIA/Megatron-LM). Average MFU during training was ~42%. We trained our model using bf16-mixed precision.
+
+
+## Getting started
+
+
+### System Prompt
+
+It is important to note that this model is optimized to function without a predefined system prompt.
+While Allam does not come with a default system prompt, it does provide the flexibility to add a custom system prompt.
+For instance, a well crafted system prompt could be:
+
+“You are ALLaM, a bilingual English and Arabic AI assistant.”
+System prompts can also be in Arabic:
+
+"أنت علام، مساعد ذكاء اصطناعي مطور من الهيئة السعودية للبيانات والذكاء الاصطناعي، تجيب على الأسئلة بطريقة مفيدة مع مراعاة القيم الثقافية المحلية."
+Alternatively, users can get creative with their prompts, such as:
+
+“You are an AI assistant who responds to everything like a pirate.”
+
+The system prompt is integrated inside the tokenizer config (accessed via `apply_chat_template()` module).
+
+
+### Example Usages
+
+The weights for ALLaM model checkpoints can be accessed via [HuggingFace transformers](https://github.com/huggingface/transformers) (tested with `transformers>=4.40.1`). The following code snippet demonstrates how to load the model and generate text using the `ALLaM-AI/ALLaM-7B-Instruct-preview` model.
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+allam_model = AutoModelForCausalLM.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
+tokenizer = AutoTokenizer.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
+messages=[
+ {"role": "user", "content": "كيف أجهز كوب شاهي؟"},
+]
+inputs = tokenizer.apply_chat_template(messages, tokenize=False)
+inputs = tokenizer(inputs, return_tensors='pt', return_token_type_ids=False)
+inputs = {k: v.to('cuda') for k,v in inputs.items()}
+allam_model = allam_model.to('cuda')
+response = allam_model.generate(**inputs, max_new_tokens=4096, do_sample=True, top_k=50, top_p=0.95, temperature=.6)
+print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
+```
+
+
+## Ethical Considerations and Limitations
+
+ALLaM is a generative model that comes with inherent uncertainties. Trials cannot encompass every possible use case. Hence, predicting ALLaM's responses in every context is not possible, leading on occasion to incorrect or biased outputs. Developers must conduct thorough safety evaluations and make specific adjustments to ensure the model is suitable for the intended purposes.
+
+*The output generated by this model is not considered a statement of NCAI, SDAIA, or any other organization.*
+
+## Evaluation
+
+### Automatic Benchmarks
+
+#### Arabic Benchmarks
+**Massive Multitask Language Understanding** (MMLU) is a collection of many multiple-choice evaluation questions sourced from various academic levels (elementary to college level). These questions are typically related to humanities, STEM, or social sciences. It was originally an English dataset, but other variants were developed for Arabic:
+
+
+- [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
+- [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
+
+**Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
+
+**Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
+
+**Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,887 test samples.
+
+**IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 9990 multiple-choice questions and 5823 true/false questions.
+
+**GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
+
+**AraPro**: A curated collection of 5,001 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
+
+**AraMath**: AraMath consists of 605 MCQs derived from [ArMath](https://github.com/reem-codes/ArMATH), which includes mathematical word problems, that was transformed to MCQs internally.
+
+**Ar-IFEval**: an Arabic instruction-following (IF) evaluation dataset designed to automatically assess language models' compliance with specified instructions through verifiable methods. The dataset consists of 535 instances, each containing two to four verifiable instructions that can be validated using deterministic programming approaches.
+
+All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
+
+The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
+
+
+| Model |AVG | ETEC
0 shot | IEN-MCQ
0 shot | IEN-TF
0 shot | AraPro
0 shot | AraMath
5 shot | Ar-IFEval
(prompt strict)
0 shot | Ar-IFEval
(inst strict)
0 shot | ExamsAR
5 shot | ACVA
5 shot | Arabic MMLU
0 Shot | Openai MMLU
0 shot | GAT
0 shot |
+|:----------------------------|:----------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
+| ALLaM-7B-Instruct-preview | 64.42 | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
+| AceGPT-v2-8B-Chat | 52.67 | 56.81 | 77.01 | 75.91 | 63.51 | 41.49 | 10.26 | 39.25 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
+| AceGPT-v2-32B-Chat | 62.23 | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 55.31 | 71.57 | 68.3 | 60.8 | 43.21 |
+| jais-family-6p7b-chat | 46.31 | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
+| jais-family-13b-chat | 49.14 | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
+| jais-family-30b-16k-chat | 52.54 | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
+| jais-family-30b-8k-chat | 53.19 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
+| jais-adapted-7b-chat | 45.19 | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
+| jais-adapted-13b-chat | 51.86 | 48.12 | 69.65 | 71.85 | 59.07 | 37.02 | 23.32 | 60.61 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
+| jais-adapted-70b-chat | 58.32 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
+| Qwen2.5-7B-Instruct | 60.55 | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
+| Qwen2.5-14B-Instruct | 71.26 | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
+| Qwen2.5-72B-Instruct | **76.91** | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
+| Mistral-7B-Instruct-v0.3 | 43.05 | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
+| Mistral-Nemo-Instruct-2407 | 53.79 | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
+| Mistral-Small-Instruct-2409 | 51.11 | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
+| Falcon3-7B-Instruct | 41.3 | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 31.84 | 58.98 | 42.08 | 32.36 | 27.99 |
+| Meta-Llama-3.1-8B-Instruct | 54.08 | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 52.51 | 69.93 | 56.43 | 44.67 | 30.9 |
+| Llama-3.3-70B-Instruct | 71.43 | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
+
+Closed models evaluations:
+
+| Model | ETEC
0 shot | IEN-MCQ
0 shot | IEN-TF
0 shot | AraPro
0 shot | AraMath
5 shot | ARIFEval
(prompt strict)
0 shot | ARIFEval
(inst strict)
0 shot | ExamsAR
5 shot | ACVA
5 shot | Arabicmmlu
0 Shot | Openai mmlu
0 shot | GAT 0 shot |
+|:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
+| Azureml GPT4o (gpt-4o-900ptu) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
+| Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
+| gemini pro 1.5 (gemini-1.5-pro) | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
+
+#### English Benchmarks
+
+| model |Avg | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks
ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro
5 Shot | Minerva Math
4 Shot | MMLU 0 Shot | TruthfulQA
(mc2) 0 Shot | IFEval
(prompt strict)
0 Shot | IFEval
(inst strict)
0 Shot | GSM8k 5 Shot |
+|:----------------------------------|:----------|:-----------------|:-----------------------|:--------------------------|:--------------------------|:--------------------|:-------------------|:------------------|:------------------|:----------------------|:--------------|:------------------------|:---------------------------------|:-------------------------------|:---------------|
+| ALLaM-7B-Instruct-preview | 46.85 | 41.99 | 51.28 | 22.77 | 73.17 | 70.48 | 76.26 | 16.07 | 30.4 | 17.3 | 59.6 | 46.67 | 38.08 | 50.0 | 61.79 |
+| AceGPT-v2-8B-Chat | 49.51 | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
+| AceGPT-v2-32B-Chat | 57.14 | 56.01 | 53.92 | 32.8125 | 66.23 | 79.16 | 83.29 | 69.45 | 45.89 | 32.8 | 74.03 | 59.18 | 27.54 | 40.89 | 78.7 |
+| jais-family-6p7b-chat | 38.33 | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
+| jais-family-13b-chat | 42.62 | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75.0 | 35.82 | 24.4 | 19.1 | 51.91 | 40.57 | 19.41 | 30.82 | 64.59 |
+| jais-family-30b-16k-chat | 45.15 | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 29.14 | 67.93 |
+| jais-family-30b-8k-chat | 47.59 | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.44 | 57.46 | 49.49 | 22.92 | 37.05 | 72.48 |
+| jais-adapted-7b-chat | 44.91 | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22.0 | 35.73 | 58.07 |
+| jais-adapted-13b-chat | 47.7 | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
+| jais-adapted-70b-chat | 53.49 | 39.96 | 59.56 | 20.98 | 70.8 | 77.27 | 84.06 | 68.64 | 37.25 | 27.72 | 65.23 | 44.49 | 31.61 | 44.0 | 77.26 |
+| Qwen2.5-7B-Instruct | 54.68 | 59.2 | 51.28 | 26.56 | 73.76 | 69.38 | 79.55 | 50.59 | 44.92 | 12.04 | 70.56 | 58.93 | 57.3 | 68.23 | 43.29 |
+| Qwen2.5-14B-Instruct | 62.37 | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 52.44 | 23.04 | 78.93 | 69.01 | 52.13 | 64.03 | 83.47 |
+| Qwen2.5-72B-Instruct | **70.06** | **71.09** | **63.48** | 25.67 | 78.33 | 76.24 | **87.41** | 70.9 | **62.77** | **54.04** | **83.44** | **69.54** | 67.65 | 77.1 | **93.25** |
+| Mistral-7B-Instruct-v0.3 | 51.98 | 36.45 | 58.87 | 23.21 | 72.58 | 73.95 | 82.93 | 67.97 | 33.18 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
+| Mistral-Nemo-Instruct-2407 | 54.0 | 39.65 | 59.04 | 24.33 | 67.86 | 74.66 | 82.35 | 72.77 | 44.27 | 29.62 | 65.56 | 54.88 | 30.13 | 38.97 | 71.95 |
+| Mistral-Small-Instruct-2409 | 61.65 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
+| Falcon3-7B-Instruct | 58.04 | 43.84 | 59.47 | **33.71** | 70.39 | 70.09 | 78.43 | 51.98 | 46.73 | 30.76 | 68.14 | 55.53 | 56.01 | 68.59 | 78.92 |
+| Meta-Llama-3.1-8B-Instruct | 56.5 | 42.39 | 55.12 | 27.23 | 66.69 | 73.95 | 79.28 | 70.05 | 40.641622 | 34.26 | 67.96 | 54.05 | 44.36 | 58.51 | 76.5 |
+| Llama-3.3-70B-Instruct | 67.7 | 55.44 | 63.4 | 25.89 | **81.05** | **79.24** | 84.39 | **81.7** | 60.51 | 46.42 | 81.99 | 60.91 | **63.22** | **72.78** | 90.83 |
+
+### MT-bench
+
+**Multi-Turn Bench** (MT-Bench): A challenging multi-turn benchmark that uses GPT-4o as a judge. MT-bench comprises 80 questions from 8 domains. Each question is presented to the model and the responses are submitted to GPT-4o to assign scores to each response. The judge returns a score for the first and second turn separately.
+This dataset was also automatically translated to Arabic and manually verified and culturally aligned.
+
+| Model | AR Average | AR Turn 1 | AR Turn 2 | EN Average | EN Turn 1 | EN Turn 2 |
+|---------------------|------------|-----------|-----------|------------|-----------|-----------|
+| ALLaM-7B-Instruct-preview | 5.9 | **6.93**| 4.88 | 6.5 | 7.49 | 5.15 |
+| AceGPT-v1.5-13B-Chat | 4.61 | 5.28 | 3.93 | 4.86 | 5.56 | 4.17 |
+| AceGPT-v2-32B-Chat |5.43 | 6.61 | 4.26 | **6.5** | **7.41** | **5.58** |
+| jais-family-13b-chat | 4.89 | 5.37 | 4.41 | 4.77 | 5.57 | 3.97
+| jais-family-30b-16k-chat | 4.87 | 5.50 | 4.25 | 5.13 | 5.86 | 4.4 |
+| jais-adapted-70b-chat | 5.86 | 6.33 | **5.38** | 5.88 | 6.41 | 5.36 |
+
+## Citation
+
+If you found this work helpful or used any part of this work, please include the following citation:
+
+```
+@inproceedings{
+bari2025allam,
+title={{ALL}aM: Large Language Models for Arabic and English},
+author={M Saiful Bari and Yazeed Alnumay and Norah A. Alzahrani and Nouf M. Alotaibi and Hisham Abdullah Alyahya and Sultan AlRashed and Faisal Abdulrahman Mirza and Shaykhah Z. Alsubaie and Hassan A. Alahmed and Ghadah Alabduljabbar and Raghad Alkhathran and Yousef Almushayqih and Raneem Alnajim and Salman Alsubaihi and Maryam Al Mansour and Saad Amin Hassan and Dr. Majed Alrubaian and Ali Alammari and Zaki Alawami and Abdulmohsen Al-Thubaity and Ahmed Abdelali and Jeril Kuriakose and Abdalghani Abujabal and Nora Al-Twairesh and Areeb Alowisheq and Haidar Khan},
+booktitle={The Thirteenth International Conference on Learning Representations},
+year={2025},
+url={https://openreview.net/forum?id=MscdsFVZrN}
+}
+```
diff --git a/config.json b/config.json
new file mode 100644
index 0000000000000000000000000000000000000000..4763b112505829c69734a4ba8a0d34e72566f48c
--- /dev/null
+++ b/config.json
@@ -0,0 +1,28 @@
+{
+ "architectures": [
+ "LlamaForCausalLM"
+ ],
+ "attention_bias": false,
+ "attention_dropout": 0.0,
+ "bos_token_id": 1,
+ "eos_token_id": 2,
+ "hidden_act": "silu",
+ "hidden_size": 4096,
+ "initializer_range": 0.006,
+ "intermediate_size": 11008,
+ "max_position_embeddings": 4096,
+ "model_type": "llama",
+ "num_attention_heads": 32,
+ "num_hidden_layers": 32,
+ "num_key_value_heads": 32,
+ "pretraining_tp": 1,
+ "rms_norm_eps": 1e-05,
+ "rope_scaling": null,
+ "rope_theta": 10000.0,
+ "tie_word_embeddings": false,
+ "torch_dtype": "bfloat16",
+ "transformers_version": "4.39.3",
+ "use_cache": true,
+ "vocab_size": 64000,
+ "internal_version": "7b-alpha-v1.27.2.25"
+}
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..51af3c245e96df028ef21a5fc4194d8c3cc8f346
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7274397244546499,
+ "acc_stderr,none": 0.004771397968508457,
+ "acc_norm,none": 0.7157290470723306,
+ "acc_norm_stderr,none": 0.004833440968499389
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737779797.3395095,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 26647.534977248,
+ "end_time": 27360.084961217,
+ "total_evaluation_time_seconds": "712.5499839689983"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6119b6233907a85e1e74fc7b111b5c6cec0adab3
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.2574626865671642,
+ "prompt_level_strict_acc_stderr,none": 0.018903377119672635,
+ "inst_level_strict_acc,none": 0.6341296928327645,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.31529850746268656,
+ "prompt_level_loose_acc_stderr,none": 0.020087907677710036,
+ "inst_level_loose_acc,none": 0.6764505119453925,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738794647.2071357,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "d0b91e989c8b697090db63bf498d8e2d8dd80815a595e5f22845a8425bff22fa"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1753623.131321269,
+ "end_time": 1761093.682009075,
+ "total_evaluation_time_seconds": "7470.550687805982"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..112b376a132dc045dcffa04c951bc58b01e968dc
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.6446280991735537,
+ "acc_stderr,none": 0.019475010007284948,
+ "acc_norm,none": 0.6446280991735537,
+ "acc_norm_stderr,none": 0.019475010007284948
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738805225.8162587,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "17b2596f46d709ea107ed20bef044ca126de23a8e9bbc8ba0a9beef94fbc032d"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1764201.606664753,
+ "end_time": 1764270.091855178,
+ "total_evaluation_time_seconds": "68.48519042483531"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9a82d840ff33da2cfff7bcb4dacd30f70e443d64
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.671865626874625,
+ "acc_stderr,none": 0.006640213946839424,
+ "acc_norm,none": 0.671865626874625,
+ "acc_norm_stderr,none": 0.006640213946839424
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738802810.5474553,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "2f706897ad0129e016cc8d6907f8bb4359c32403fc2d1b0a4e78717f424793da"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1761786.552693387,
+ "end_time": 1761894.218775138,
+ "total_evaluation_time_seconds": "107.66608175099827"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a3e2104ac267de7aee1f831ffb863836fe192612
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6830162573503978,
+ "acc_stderr,none": 0.0037666673237025995,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.698180815876516,
+ "acc_stderr,none": 0.0074113813583826975,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5578947368421052,
+ "acc_stderr,none": 0.01802677701787401
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7365269461077845,
+ "acc_stderr,none": 0.02414016899389538
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6410256410256411,
+ "acc_stderr,none": 0.07781756136754926
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.5915492957746479,
+ "acc_stderr,none": 0.019460543090359293
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.7142857142857143,
+ "acc_stderr,none": 0.03178529710642749
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7142857142857143,
+ "acc_stderr,none": 0.029344572500634363
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.0465501041131961
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8348348348348348,
+ "acc_stderr,none": 0.01175423146342287
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7707006369426752,
+ "acc_stderr,none": 0.02376140487281449
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6877278250303767,
+ "acc_stderr,none": 0.010897190392354756,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7990196078431373,
+ "acc_stderr,none": 0.01621193888965557
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.726027397260274,
+ "acc_stderr,none": 0.023376494233709237
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.441025641025641,
+ "acc_stderr,none": 0.025174048384000766
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.8148148148148148,
+ "acc_stderr,none": 0.07618086585254093
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.7301587301587301,
+ "acc_stderr,none": 0.028017279737180052
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7210144927536232,
+ "acc_stderr,none": 0.008956944496736811,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.7506193228736582,
+ "acc_stderr,none": 0.012437943646387221
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6574074074074074,
+ "acc_stderr,none": 0.016154773861994782
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.7441860465116279,
+ "acc_stderr,none": 0.03336605189761063
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.0327648791455327
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.046499055497527676
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6726598173515982,
+ "acc_stderr,none": 0.007798259846846906,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.5057471264367817,
+ "acc_stderr,none": 0.053912824825556656
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.7111111111111111,
+ "acc_stderr,none": 0.023921418402752255
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.6040462427745664,
+ "acc_stderr,none": 0.015186858609050091
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.6059322033898306,
+ "acc_stderr,none": 0.03187598097180376
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.8160919540229885,
+ "acc_stderr,none": 0.04177540678018987
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.7132352941176471,
+ "acc_stderr,none": 0.02747227447323382
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5518672199170125,
+ "acc_stderr,none": 0.032100739315089555
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7368421052631579,
+ "acc_stderr,none": 0.058843894144731304
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.8056737588652483,
+ "acc_stderr,none": 0.014912793524753134
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.6756756756756757,
+ "acc_stderr,none": 0.05478951716752587
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.6496350364963503,
+ "acc_stderr,none": 0.040909634620704266
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.03260773253630123
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6451612903225806,
+ "acc_stderr,none": 0.008155612741868946,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.525195173882186,
+ "acc_stderr,none": 0.013308116628249263
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.7164750957854407,
+ "acc_stderr,none": 0.027951780795387696
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.5764705882352941,
+ "acc_stderr,none": 0.03100369860682665
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8518518518518519,
+ "acc_stderr,none": 0.06966962541673782
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.8140495867768595,
+ "acc_stderr,none": 0.025061985980100218
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7315789473684211,
+ "acc_stderr,none": 0.032233538609655936
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.684596577017115,
+ "acc_stderr,none": 0.023004906965559055
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8988095238095238,
+ "acc_stderr,none": 0.01647711789379545
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.703125,
+ "acc_stderr,none": 0.05756159356351619
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6830162573503978,
+ "acc_stderr,none": 0.0037666673237025995,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.698180815876516,
+ "acc_stderr,none": 0.0074113813583826975,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6877278250303767,
+ "acc_stderr,none": 0.010897190392354756,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7210144927536232,
+ "acc_stderr,none": 0.008956944496736811,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6726598173515982,
+ "acc_stderr,none": 0.007798259846846906,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6451612903225806,
+ "acc_stderr,none": 0.008155612741868946,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_middle_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_prof_law",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_high_civics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_political_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_driving_test"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 4
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737779092.1744986,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 25942.251738535,
+ "end_time": 26447.764031496,
+ "total_evaluation_time_seconds": "505.51229296100064"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e18b7a2ab40940e3f0bd607d620c4e42a7828632
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.6481187069422364,
+ "acc_stderr,none": 0.010996501146375258,
+ "acc_norm,none": 0.6481187069422364,
+ "acc_norm_stderr,none": 0.010996501146375258
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738805984.3189015,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "697b8bfc7d6b0f85165e5cca6953182b09b7a2b0d79fa31e74cc3897f432de41"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1764960.166542801,
+ "end_time": 1765035.801506021,
+ "total_evaluation_time_seconds": "75.63496321998537"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..eae6472dddfebb62b63bc61c8de9c12b5f56b271
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.553072625698324,
+ "acc_stderr,none": 0.021474702941383872,
+ "acc_norm,none": 0.553072625698324,
+ "acc_norm_stderr,none": 0.021474702941383872
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737780545.20475,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 27395.295045238,
+ "end_time": 27506.949709817,
+ "total_evaluation_time_seconds": "111.65466457900038"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e57a59c502c47c1882e36df658b16e30b8c0e53f
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.4321459927254484,
+ "acc_stderr,none": 0.0038347299693873033,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.3992578849721707,
+ "acc_stderr,none": 0.009435653731651068
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.2867030965391621,
+ "acc_stderr,none": 0.00863295163043938
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.3894000736105999,
+ "acc_stderr,none": 0.009356458715331561
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.4143540669856459,
+ "acc_stderr,none": 0.01524590184737997
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.34672131147540985,
+ "acc_stderr,none": 0.013631312083187472
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.5793388429752067,
+ "acc_stderr,none": 0.014197745251253151
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.522239263803681,
+ "acc_stderr,none": 0.013837823280527494
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.5013698630136987,
+ "acc_stderr,none": 0.026207022561245137
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.585633270321361,
+ "acc_stderr,none": 0.009580200187530542
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.4321459927254484,
+ "acc_stderr,none": 0.0038347299693873033,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "ef4b2026",
+ "date": 1733932681.9722512,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.47.0",
+ "upper_git_hash": "27ba526c4b16ee30604687f8bfd4c19680101dd1",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2367.995520754,
+ "end_time": 5482.980996963,
+ "total_evaluation_time_seconds": "3114.9854762089994"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e56e9d5442a16eb2ae094a29034403c990837e58
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.816016016016016,
+ "acc_stderr,none": 0.0038768441643790346,
+ "acc_norm,none": 0.816016016016016,
+ "acc_norm_stderr,none": 0.0038768441643790346
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738807582.4110897,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "e5422ff2f277b9bfffeb1b5ad185b714804b5a3d276dfff99a29eb88d9a41683"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1766558.431540363,
+ "end_time": 1766704.504224634,
+ "total_evaluation_time_seconds": "146.07268427102827"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6ac4bc7528abb57640622bd42de52c0651b70f9e
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.8035376953460416,
+ "acc_stderr,none": 0.005207228603848848,
+ "acc_norm,none": 0.8035376953460416,
+ "acc_norm_stderr,none": 0.005207228603848848
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738809377.2163908,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "116cb28cd11c72b01c3d52d75d3918c312d0a4f569bfdb8b2219398ec576a3f4"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
+ "start_time": 1768353.06839988,
+ "end_time": 1768502.097875321,
+ "total_evaluation_time_seconds": "149.0294754409697"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json b/evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..39bb8d3c5186397b52d858e8f1a59963f429535d
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json
@@ -0,0 +1,2660 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.608033043725965,
+ "acc_stderr,none": 0.003975835153459076,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.5516556291390728,
+ "acc_stderr,none": 0.008782384894291078,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.04852365870939099
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.7171052631578947,
+ "acc_stderr,none": 0.03665349695640767
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.6597222222222222,
+ "acc_stderr,none": 0.03962135573486219
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.05
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.049756985195624284
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.35,
+ "acc_stderr,none": 0.0479372485441102
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.3627450980392157,
+ "acc_stderr,none": 0.047840607041056527
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.69,
+ "acc_stderr,none": 0.04648231987117316
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.6468085106382979,
+ "acc_stderr,none": 0.031245325202761926
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.43859649122807015,
+ "acc_stderr,none": 0.04668000738510455
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5379310344827586,
+ "acc_stderr,none": 0.041546596717075474
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.5608465608465608,
+ "acc_stderr,none": 0.02555992055053101
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.7516129032258064,
+ "acc_stderr,none": 0.024580028921481003
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.49261083743842365,
+ "acc_stderr,none": 0.03517603540361008
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.72,
+ "acc_stderr,none": 0.04512608598542127
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.4148148148148148,
+ "acc_stderr,none": 0.030039842454069293
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.44370860927152317,
+ "acc_stderr,none": 0.04056527902281732
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.03388857118502325
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6978935698447893,
+ "acc_stderr,none": 0.010692790487345947,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7878787878787878,
+ "acc_stderr,none": 0.03192271569548299
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7794117647058824,
+ "acc_stderr,none": 0.02910225438967409
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7974683544303798,
+ "acc_stderr,none": 0.02616056824660146
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7603305785123967,
+ "acc_stderr,none": 0.03896878985070417
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.7129629629629629,
+ "acc_stderr,none": 0.043733130409147614
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6625766871165644,
+ "acc_stderr,none": 0.03714908409935574
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.6205787781350482,
+ "acc_stderr,none": 0.027559949802347824
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.6172839506172839,
+ "acc_stderr,none": 0.027044538138402616
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6491228070175439,
+ "acc_stderr,none": 0.03660298834049164
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.587491571139582,
+ "acc_stderr,none": 0.00615652758733159,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4740740740740741,
+ "acc_stderr,none": 0.04313531696750574
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.630188679245283,
+ "acc_stderr,none": 0.029711421880107936
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.5953757225433526,
+ "acc_stderr,none": 0.03742461193887249
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5079365079365079,
+ "acc_stderr,none": 0.044715725362943486
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.38,
+ "acc_stderr,none": 0.048783173121456316
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7828282828282829,
+ "acc_stderr,none": 0.02937661648494563
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.7889908256880734,
+ "acc_stderr,none": 0.01749392240411265
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6502242152466368,
+ "acc_stderr,none": 0.03200736719484503
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.375,
+ "acc_stderr,none": 0.04595091388086298
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.55,
+ "acc_stderr,none": 0.049999999999999996
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.7726692209450831,
+ "acc_stderr,none": 0.014987270640946024
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6830065359477124,
+ "acc_stderr,none": 0.026643278474508755
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.450354609929078,
+ "acc_stderr,none": 0.02968010556502904
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.45371577574967403,
+ "acc_stderr,none": 0.01271540484127774
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.5441176470588235,
+ "acc_stderr,none": 0.030254372573976725
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.5816993464052288,
+ "acc_stderr,none": 0.019955975145835542
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4879518072289157,
+ "acc_stderr,none": 0.03891364495835821
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.6475958612294583,
+ "acc_stderr,none": 0.008094925999116912,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.71,
+ "acc_stderr,none": 0.04560480215720684
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.7979274611398963,
+ "acc_stderr,none": 0.02897908979429673
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.6538461538461539,
+ "acc_stderr,none": 0.024121125416941187
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.7016806722689075,
+ "acc_stderr,none": 0.02971914287634285
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6946564885496184,
+ "acc_stderr,none": 0.04039314978724561
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.7669902912621359,
+ "acc_stderr,none": 0.04185832598928315
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8034188034188035,
+ "acc_stderr,none": 0.02603538609895129
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.5838150289017341,
+ "acc_stderr,none": 0.026538189104705488
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.4860335195530726,
+ "acc_stderr,none": 0.016715976410744522
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6454545454545455,
+ "acc_stderr,none": 0.04582004841505415
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7306122448979592,
+ "acc_stderr,none": 0.02840125202902294
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.7661691542288557,
+ "acc_stderr,none": 0.029929415408348387
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.82,
+ "acc_stderr,none": 0.03861229196653695
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.608033043725965,
+ "acc_stderr,none": 0.003975835153459076,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.5516556291390728,
+ "acc_stderr,none": 0.008782384894291078,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6978935698447893,
+ "acc_stderr,none": 0.010692790487345947,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.587491571139582,
+ "acc_stderr,none": 0.00615652758733159,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.6475958612294583,
+ "acc_stderr,none": 0.008094925999116912,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_international_law",
+ "openaimmlu_world_religions",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_prehistory"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_business_ethics",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_security_studies",
+ "openaimmlu_marketing",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_management",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_sociology",
+ "openaimmlu_public_relations",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_moral_disputes"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_college_medicine",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_professional_law",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_formal_logic",
+ "openaimmlu_global_facts",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_virology",
+ "openaimmlu_machine_learning",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_nutrition",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_human_aging",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_anatomy"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_physics",
+ "openaimmlu_computer_security",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_biology",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_astronomy",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_college_chemistry"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735084516.9093957,
+ "pretty_env_info": "PyTorch version: 2.5.1+cu124\nIs debug build: False\nCUDA used to build PyTorch: 12.4\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.5.1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.5.1\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.20.1\n[pip3] triton==3.1.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "18b53334e0494773088a01c543e721a58f958e0d",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1864.708383469,
+ "end_time": 4236.282044429,
+ "total_evaluation_time_seconds": "2371.57366096"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..586ce37e8d9b07c8962dcb93caca59161161b777
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7415614236509759,
+ "acc_stderr,none": 0.004691028694524559,
+ "acc_norm,none": 0.7268656716417911,
+ "acc_norm_stderr,none": 0.004774534958083965
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736966813.484974,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2430.929540314,
+ "end_time": 3025.204908665,
+ "total_evaluation_time_seconds": "594.275368351"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..74543130f214fe652abe485d61df5a7230c2efeb
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.10261194029850747,
+ "prompt_level_strict_acc_stderr,none": 0.01311934649092474,
+ "inst_level_strict_acc,none": 0.3924914675767918,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.12126865671641791,
+ "prompt_level_loose_acc_stderr,none": 0.01411319854290401,
+ "inst_level_loose_acc,none": 0.42389078498293514,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739784109.8369951,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "ar_ifeval": "9ce88f26b4b78e684512ecd933af67fe512192f41e27d2bedc62f288943db360"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 62023.729831301,
+ "end_time": 66967.714743853,
+ "total_evaluation_time_seconds": "4943.98491255199"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e84ce922d21f7d94450285b1eec7b64a2b4b3bdf
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.41487603305785126,
+ "acc_stderr,none": 0.02004770429343817,
+ "acc_norm,none": 0.41487603305785126,
+ "acc_norm_stderr,none": 0.02004770429343817
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739784015.8084505,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "araMath_v3": "4eebd1da6e6937fc09bb9f1871adb53192dbce96733f0f8ee76d406c2fc8cad5"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 61929.69246185,
+ "end_time": 61980.464828513,
+ "total_evaluation_time_seconds": "50.772366663004505"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..36e776e6abe039d5ec46c957dc69a556c04e6a5c
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.6350729854029195,
+ "acc_stderr,none": 0.006808161111700288,
+ "acc_norm,none": 0.6350729854029195,
+ "acc_norm_stderr,none": 0.006808161111700288
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739782427.4652286,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "araPro": "655c2f6626c4b10533bba45ff63f9d4501694dea7f65d0bb251390819154f901"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 60341.23142254,
+ "end_time": 60939.383586887,
+ "total_evaluation_time_seconds": "598.1521643470041"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a8675c2c51e08b6e7742eaf3bdc8dda01903b458
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5701833275683155,
+ "acc_stderr,none": 0.004022804239111275,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5986769570011026,
+ "acc_stderr,none": 0.007913780660392408,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.4473684210526316,
+ "acc_stderr,none": 0.018048022490206213
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6167664670658682,
+ "acc_stderr,none": 0.026642195538092498
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6410256410256411,
+ "acc_stderr,none": 0.07781756136754925
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.4788732394366197,
+ "acc_stderr,none": 0.019777510897112938
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.625615763546798,
+ "acc_stderr,none": 0.03405155380561952
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6512605042016807,
+ "acc_stderr,none": 0.03095663632856655
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.5686274509803921,
+ "acc_stderr,none": 0.04928099597287534
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7267267267267268,
+ "acc_stderr,none": 0.014106487065973254
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7292993630573248,
+ "acc_stderr,none": 0.025114549205469412
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5364520048602673,
+ "acc_stderr,none": 0.012108801239884191,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6062091503267973,
+ "acc_stderr,none": 0.019766211991073063
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5561643835616439,
+ "acc_stderr,none": 0.026041258579497174
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.37948717948717947,
+ "acc_stderr,none": 0.024603626924097424
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.6296296296296297,
+ "acc_stderr,none": 0.09470524295495535
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5714285714285714,
+ "acc_stderr,none": 0.031236022160528717
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6260064412238325,
+ "acc_stderr,none": 0.009658814860868633,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6672171758876961,
+ "acc_stderr,none": 0.013546321390449019
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.5567129629629629,
+ "acc_stderr,none": 0.016910357335226688
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6046511627906976,
+ "acc_stderr,none": 0.037389066648335266
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6790123456790124,
+ "acc_stderr,none": 0.03679341185411387
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6933333333333334,
+ "acc_stderr,none": 0.053602922245650664
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5630707762557078,
+ "acc_stderr,none": 0.00827055654190365,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4367816091954023,
+ "acc_stderr,none": 0.053483689652870973
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5694444444444444,
+ "acc_stderr,none": 0.026133227823568903
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.4951830443159923,
+ "acc_stderr,none": 0.01552603179799726
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4830508474576271,
+ "acc_stderr,none": 0.03259765859155327
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.8045977011494253,
+ "acc_stderr,none": 0.042756781109738705
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.5882352941176471,
+ "acc_stderr,none": 0.029896163033125478
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5062240663900415,
+ "acc_stderr,none": 0.032272360529663036
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.6140350877192983,
+ "acc_stderr,none": 0.06505437269382161
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.6879432624113475,
+ "acc_stderr,none": 0.017462513832971892
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5405405405405406,
+ "acc_stderr,none": 0.05832789513012364
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5255474452554745,
+ "acc_stderr,none": 0.04281864355155348
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5619047619047619,
+ "acc_stderr,none": 0.0343196207118653
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5195740682743502,
+ "acc_stderr,none": 0.008544528678702652,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.41660752306600424,
+ "acc_stderr,none": 0.013138404810302533
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5938697318007663,
+ "acc_stderr,none": 0.030457313978978034
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.3803921568627451,
+ "acc_stderr,none": 0.030461926918286298
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.09245003270420485
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.6611570247933884,
+ "acc_stderr,none": 0.030488989466217694
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6578947368421053,
+ "acc_stderr,none": 0.03450858738901066
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5158924205378973,
+ "acc_stderr,none": 0.024741181384437986
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.7767857142857143,
+ "acc_stderr,none": 0.022750408778833362
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.703125,
+ "acc_stderr,none": 0.05756159356351619
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5701833275683155,
+ "acc_stderr,none": 0.004022804239111275,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5986769570011026,
+ "acc_stderr,none": 0.007913780660392408,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5364520048602673,
+ "acc_stderr,none": 0.012108801239884191,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6260064412238325,
+ "acc_stderr,none": 0.009658814860868633,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5630707762557078,
+ "acc_stderr,none": 0.00827055654190365,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5195740682743502,
+ "acc_stderr,none": 0.008544528678702652,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_univ_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_middle_civics"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735750331.498813,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 10616.839471692,
+ "end_time": 11074.169545653,
+ "total_evaluation_time_seconds": "457.3300739610004"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..83d2d44c8a65298a00ead012e06f751ba66d6302
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.5680975092739798,
+ "acc_stderr,none": 0.011406002243769559,
+ "acc_norm,none": 0.5680975092739798,
+ "acc_norm_stderr,none": 0.011406002243769559
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739783073.791851,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "etec_v2": "d371135bd6f3e91b2eb292576c3b2fae24dc4c0d7cd2a5f6eacf1fe6bc062e76"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 60987.772646854,
+ "end_time": 61072.230445773,
+ "total_evaluation_time_seconds": "84.4577989190002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9b33ee4f39033580d7df247b4d5f1f5de7485f35
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.5195530726256983,
+ "acc_stderr,none": 0.02158019049784565,
+ "acc_norm,none": 0.5195530726256983,
+ "acc_norm_stderr,none": 0.02158019049784565
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735747770.5687191,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 8055.848670643,
+ "end_time": 8272.25518881,
+ "total_evaluation_time_seconds": "216.40651816700029"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8afd5c40ea7001636c3d685211615d041870c93e
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.3615326727706008,
+ "acc_stderr,none": 0.003748588350676633,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.30241187384044527,
+ "acc_stderr,none": 0.008849121616191958
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.3227686703096539,
+ "acc_stderr,none": 0.008925286248200312
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.3213102686786897,
+ "acc_stderr,none": 0.008960516811645579
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.39425837320574164,
+ "acc_stderr,none": 0.01512460088966808
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.28114754098360656,
+ "acc_stderr,none": 0.012876124676937594
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.46115702479338844,
+ "acc_stderr,none": 0.014336474830596175
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.2983128834355828,
+ "acc_stderr,none": 0.012674637536976358
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.3232876712328767,
+ "acc_stderr,none": 0.024515791774351408
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5183364839319471,
+ "acc_stderr,none": 0.009717331969425425
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.3615326727706008,
+ "acc_stderr,none": 0.003748588350676633,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735749781.6371627,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 10066.91226392,
+ "end_time": 10586.891967311,
+ "total_evaluation_time_seconds": "519.9797033909999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0ca9bc1f3269c41d2c3321129cfe678856edb540
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.7700700700700701,
+ "acc_stderr,none": 0.0042101916833611345,
+ "acc_norm,none": 0.7700700700700701,
+ "acc_norm_stderr,none": 0.0042101916833611345
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739783202.062394,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "moe_ien_mcq": "99731f9d1bb76d010da5a439ea1b0bb7695451459d680f708f7222f02ba8e831"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 61116.014324615,
+ "end_time": 61463.567260828,
+ "total_evaluation_time_seconds": "347.5529362130037"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..90ea7d41693648e62e021ddbabbc63664816c431
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7590589043448395,
+ "acc_stderr,none": 0.00560476076159517,
+ "acc_norm,none": 0.7590589043448395,
+ "acc_norm_stderr,none": 0.00560476076159517
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739783594.7150183,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {
+ "moe_ien_tf": "a8315c59ec304a82f04395ff5e7728d6586b1b0b5f569486840b7d29d76a8dd8"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 61508.598662402,
+ "end_time": 61883.458017876,
+ "total_evaluation_time_seconds": "374.85935547400004"
+}
\ No newline at end of file
diff --git a/evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json b/evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..105f50ae826afd8c1d944ccd2328f35c1e50d5d4
--- /dev/null
+++ b/evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.49992878507335137,
+ "acc_stderr,none": 0.004078575700822945,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.41456953642384103,
+ "acc_stderr,none": 0.008797147564007037,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.42,
+ "acc_stderr,none": 0.049604496374885836
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5394736842105263,
+ "acc_stderr,none": 0.04056242252249034
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5069444444444444,
+ "acc_stderr,none": 0.04180806750294938
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.38,
+ "acc_stderr,none": 0.048783173121456316
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.27,
+ "acc_stderr,none": 0.044619604333847394
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.23529411764705882,
+ "acc_stderr,none": 0.042207736591714534
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.04923659639173309
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.44680851063829785,
+ "acc_stderr,none": 0.0325005368436584
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.35964912280701755,
+ "acc_stderr,none": 0.04514496132873633
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4482758620689655,
+ "acc_stderr,none": 0.04144311810878151
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.3544973544973545,
+ "acc_stderr,none": 0.024636830602842
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5774193548387097,
+ "acc_stderr,none": 0.02810096472427264
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.3891625615763547,
+ "acc_stderr,none": 0.03430462416103872
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.04943110704237101
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.3296296296296296,
+ "acc_stderr,none": 0.02866120111652458
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.3509933774834437,
+ "acc_stderr,none": 0.03896981964257375
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.3148148148148148,
+ "acc_stderr,none": 0.03167468706828979
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6058758314855875,
+ "acc_stderr,none": 0.011278032493102804,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7393939393939394,
+ "acc_stderr,none": 0.03427743175816524
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6911764705882353,
+ "acc_stderr,none": 0.03242661719827218
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7341772151898734,
+ "acc_stderr,none": 0.028756799629658332
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6776859504132231,
+ "acc_stderr,none": 0.042664163633521685
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.6388888888888888,
+ "acc_stderr,none": 0.04643454608906275
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.5766871165644172,
+ "acc_stderr,none": 0.03881891213334384
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.5112540192926045,
+ "acc_stderr,none": 0.028390897396863533
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.45987654320987653,
+ "acc_stderr,none": 0.02773102275353927
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6023391812865497,
+ "acc_stderr,none": 0.03753638955761691
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.49730276466621715,
+ "acc_stderr,none": 0.006341766264221109,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.45925925925925926,
+ "acc_stderr,none": 0.04304979692464243
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5471698113207547,
+ "acc_stderr,none": 0.030635627957961816
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.4624277456647399,
+ "acc_stderr,none": 0.0380168510452446
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.4126984126984127,
+ "acc_stderr,none": 0.04403438954768177
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.048523658709390974
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.696969696969697,
+ "acc_stderr,none": 0.032742879140268674
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.655045871559633,
+ "acc_stderr,none": 0.020380605405066966
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.5650224215246636,
+ "acc_stderr,none": 0.033272833702713445
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.33035714285714285,
+ "acc_stderr,none": 0.04464285714285714
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.6475095785440613,
+ "acc_stderr,none": 0.017084150244081376
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.565359477124183,
+ "acc_stderr,none": 0.028384256704883037
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3723404255319149,
+ "acc_stderr,none": 0.02883892147125145
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.39048239895697523,
+ "acc_stderr,none": 0.012460135913945071
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4375,
+ "acc_stderr,none": 0.030134614954403924
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.46895424836601307,
+ "acc_stderr,none": 0.02018880445636189
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.46987951807228917,
+ "acc_stderr,none": 0.03885425420866766
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5249543517954961,
+ "acc_stderr,none": 0.008306273559742111,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.6528497409326425,
+ "acc_stderr,none": 0.03435696168361355
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.5102564102564102,
+ "acc_stderr,none": 0.025345672221942374
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.5042016806722689,
+ "acc_stderr,none": 0.03247734334448111
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6183206106870229,
+ "acc_stderr,none": 0.04260735157644561
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6310679611650486,
+ "acc_stderr,none": 0.0477761518115674
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.7350427350427351,
+ "acc_stderr,none": 0.02891120880274948
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.5520231213872833,
+ "acc_stderr,none": 0.026772990653361833
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.3005586592178771,
+ "acc_stderr,none": 0.01533456680625117
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6454545454545455,
+ "acc_stderr,none": 0.04582004841505417
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6244897959183674,
+ "acc_stderr,none": 0.03100120903989484
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6865671641791045,
+ "acc_stderr,none": 0.032801882053486435
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.76,
+ "acc_stderr,none": 0.04292346959909282
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.49992878507335137,
+ "acc_stderr,none": 0.004078575700822945,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.41456953642384103,
+ "acc_stderr,none": 0.008797147564007037,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6058758314855875,
+ "acc_stderr,none": 0.011278032493102804,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.49730276466621715,
+ "acc_stderr,none": 0.006341766264221109,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5249543517954961,
+ "acc_stderr,none": 0.008306273559742111,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_philosophy",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_international_law"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_management",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_public_relations",
+ "openaimmlu_security_studies",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_marketing",
+ "openaimmlu_business_ethics"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_medical_genetics",
+ "openaimmlu_anatomy",
+ "openaimmlu_virology",
+ "openaimmlu_global_facts",
+ "openaimmlu_nutrition",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_college_medicine",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_formal_logic",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_professional_law"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_college_physics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_biology",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_computer_security",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_mathematics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736967434.1317873,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3051.767455257,
+ "end_time": 3330.634011851,
+ "total_evaluation_time_seconds": "278.86655659400003"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json b/evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..044e2973d7d0f17ef13d4ae709e9184e5356db72
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7746268656716417,
+ "acc_stderr,none": 0.004477269169728854,
+ "acc_norm,none": 0.7632606199770379,
+ "acc_norm_stderr,none": 0.004554991129754026
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735662713.7617116,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "acva": "d007c508f0accdd697f549d7cbe7f960f1470c8f86f1a0969355a6ef33108edb"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3374.021232778,
+ "end_time": 3578.563943596,
+ "total_evaluation_time_seconds": "204.54271081800016"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6d33e9d2fbe764f2a5ebf1c08f62e88c7101ce33
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.31343283582089554,
+ "prompt_level_strict_acc_stderr,none": 0.020055655889994813,
+ "inst_level_strict_acc,none": 0.6764505119453925,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.3656716417910448,
+ "prompt_level_loose_acc_stderr,none": 0.020822161638297296,
+ "inst_level_loose_acc,none": 0.7051194539249147,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618378.981141,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "ar_ifeval": "d0db7903ef270d7dc54efe4e7713be0de9864fc3a36c901c6e5777a6a5f69aa9"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1393068.333905473,
+ "end_time": 1397143.169266589,
+ "total_evaluation_time_seconds": "4074.8353611161"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json b/evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d0c9401834dc98ff9bccbeaf4d4ca0df423c9609
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.6677685950413224,
+ "acc_stderr,none": 0.019165266705090528,
+ "acc_norm,none": 0.6677685950413224,
+ "acc_norm_stderr,none": 0.019165266705090528
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618269.6292942,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araMath_v3": "e7f60b63c44ee90c76a61f37207fa1f812622b6662200911fcfd7dabe78ada66"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1392959.193182268,
+ "end_time": 1393012.133225703,
+ "total_evaluation_time_seconds": "52.940043434966356"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5b3d068872f9ef0968d9d9bab4b1651b09a379b0
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.6970605878824235,
+ "acc_stderr,none": 0.006498724870364006,
+ "acc_norm,none": 0.6970605878824235,
+ "acc_norm_stderr,none": 0.006498724870364006
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617164.0204737,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araPro": "01340c360a1565c46298c4c24dd3fdfe1ea614c6eef6e4d4f021f1da83da2584"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1391853.516943726,
+ "end_time": 1392050.054185297,
+ "total_evaluation_time_seconds": "196.5372415711172"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b83e2d22316e09f7fb9721e431a2cebc66f5ca41
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json
@@ -0,0 +1,2086 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6777585610515393,
+ "acc_stderr,none": 0.0037651094938210825,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7196802646085998,
+ "acc_stderr,none": 0.007156852970625745,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5039473684210526,
+ "acc_stderr,none": 0.01814828462669052
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7485029940119761,
+ "acc_stderr,none": 0.023776124368602287
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.7435897435897436,
+ "acc_stderr,none": 0.07083413480167725
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.704225352112676,
+ "acc_stderr,none": 0.018068660651366884
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.7241379310344828,
+ "acc_stderr,none": 0.03144712581678242
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7647058823529411,
+ "acc_stderr,none": 0.027553614467863807
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.7647058823529411,
+ "acc_stderr,none": 0.04220773659171455
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8708708708708709,
+ "acc_stderr,none": 0.010615091024310195
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7070063694267515,
+ "acc_stderr,none": 0.025725781937262132
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.7053462940461726,
+ "acc_stderr,none": 0.010675632352174308,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.8088235294117647,
+ "acc_stderr,none": 0.01590829013627805
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.7232876712328767,
+ "acc_stderr,none": 0.02344871747678411
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.45384615384615384,
+ "acc_stderr,none": 0.025242770987126177
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.8518518518518519,
+ "acc_stderr,none": 0.06966962541673782
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.8015873015873016,
+ "acc_stderr,none": 0.025172322396351483
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7089371980676329,
+ "acc_stderr,none": 0.009115340366470213,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6985962014863749,
+ "acc_stderr,none": 0.013191518335507111
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.7199074074074074,
+ "acc_stderr,none": 0.015285643798521893
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6802325581395349,
+ "acc_stderr,none": 0.035665455380848116
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7654320987654321,
+ "acc_stderr,none": 0.03339448023577033
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6933333333333334,
+ "acc_stderr,none": 0.05360292224565066
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.641837899543379,
+ "acc_stderr,none": 0.00797908211240422,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4827586206896552,
+ "acc_stderr,none": 0.05388432214060092
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.625,
+ "acc_stderr,none": 0.025551030374592384
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.5770712909441233,
+ "acc_stderr,none": 0.015341186146893518
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5932203389830508,
+ "acc_stderr,none": 0.03204451480926517
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7471264367816092,
+ "acc_stderr,none": 0.04687049503854671
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.7132352941176471,
+ "acc_stderr,none": 0.02747227447323382
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5767634854771784,
+ "acc_stderr,none": 0.03189222523446444
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7719298245614035,
+ "acc_stderr,none": 0.05606981784761176
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7815602836879433,
+ "acc_stderr,none": 0.015572585115281092
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.6351351351351351,
+ "acc_stderr,none": 0.05634270081349515
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5693430656934306,
+ "acc_stderr,none": 0.04246032224326305
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5952380952380952,
+ "acc_stderr,none": 0.03395252139627751
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6310679611650486,
+ "acc_stderr,none": 0.008195409873199793,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.5095812633073101,
+ "acc_stderr,none": 0.013322598053209577
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.6934865900383141,
+ "acc_stderr,none": 0.02859282719866765
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.5176470588235295,
+ "acc_stderr,none": 0.031353244021767535
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.9259259259259259,
+ "acc_stderr,none": 0.051361129280113826
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.8016528925619835,
+ "acc_stderr,none": 0.02568606613318377
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7473684210526316,
+ "acc_stderr,none": 0.031606782497111685
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.6772616136919315,
+ "acc_stderr,none": 0.023145867389961022
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8839285714285714,
+ "acc_stderr,none": 0.017500435136664095
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.765625,
+ "acc_stderr,none": 0.053369535239372906
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6777585610515393,
+ "acc_stderr,none": 0.0037651094938210825,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7196802646085998,
+ "acc_stderr,none": 0.007156852970625745,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.7053462940461726,
+ "acc_stderr,none": 0.010675632352174308,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7089371980676329,
+ "acc_stderr,none": 0.009115340366470213,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.641837899543379,
+ "acc_stderr,none": 0.00797908211240422,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6310679611650486,
+ "acc_stderr,none": 0.008195409873199793,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_primary_arabic_language"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_primary_natural_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_prof_law",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_primary_history"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_driving_test"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735662320.4500997,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "arabicmmlu_primary_general_knowledge": "9c41f9b2409e40ac46be285d8ef0c425c69f2e89f389af149388ed3317803f47",
+ "arabicmmlu_general_knowledge": "d0d398d26921bf02c874c7f6261b3b35569d2e5d4f5ff0b57c3849702ac76c7d",
+ "arabicmmlu_middle_general_knowledge": "01dc69e7e4349d3ad2d4c3a1aa9c3223aa6b80b49eb927328995d78a7119d12e",
+ "arabicmmlu_univ_management": "a75412840fc2690239048b87ff63c88576d098043214e33c0f893ae262adf558",
+ "arabicmmlu_driving_test": "1294a352f9996956b5eb556dfb4ad8da6c107cf83d78057e03423a1d263271eb",
+ "arabicmmlu_middle_social_science": "aaa200ab5bef99e627e5cc2339616fe893324ba9f0e6bc21b1cbf50fb12f87a4",
+ "arabicmmlu_univ_economics": "ec1e184a96e1c5fb9ebcf75c7a681987e10269f310970712fa7e08cf08aedf9c",
+ "arabicmmlu_univ_accounting": "e86c0c589105cd0a8799c9f9ed5d3be8fd66a372b0c276d841224253ac26caf3",
+ "arabicmmlu_high_civics": "1782368ed0854ebb92d306d63b5309220d9dbc812e759134bdb319a4798a9f4a",
+ "arabicmmlu_high_economics": "98ec2aac658625844ae7905b5bbb20e9b1d008e80237fac4562d269c98d95036",
+ "arabicmmlu_middle_geography": "11b273709d3739cd0ca0112960b7f80126185838d2573abf434f4d13b1b58a41",
+ "arabicmmlu_primary_geography": "280a1771b756a73d2e6ded00eecadbac20e4ee1ef00949a3b0825e9d997c6125",
+ "arabicmmlu_middle_civics": "ddbc97ff3f96ceaff0e296b6c9bf792f50d50f076200ca9a60bf72137508246d",
+ "arabicmmlu_high_geography": "faf4ba7fc6c07d9d395ab8b3cf1d3f62d2aa51297d1de2417503d99725ee5968",
+ "arabicmmlu_middle_economics": "411a71e9a0975e178836323da11af60b68483e80e6e50c16e8ab5a4399b15cf6",
+ "arabicmmlu_univ_political_science": "1b4e81c09070ed52587d966e92a753718fd6afc4f22b885a75aeca950f7bbc44",
+ "arabicmmlu_primary_social_science": "14b9797e030d4915891382e67f531aff407f495a0c95de390cb140415da4853e",
+ "arabicmmlu_prof_law": "929be8388dbe8a64e52db14f2d17ab627b51fa59718b97bab57d7f885ae22745",
+ "arabicmmlu_middle_islamic_studies": "212f989ad1b21aa4d465b9eac1f49cbc7885f57130768926cc6b44299bab862b",
+ "arabicmmlu_high_philosophy": "7918cb8aff5e2ce06d60f7b8a476db496f12f1c528a5c76dab4e1a7a3802615f",
+ "arabicmmlu_high_islamic_studies": "36c0092e41cc9b74cf95e7580a22cd3bc6c1c8be1b583aeef612303a644ee5d1",
+ "arabicmmlu_islamic_studies": "61441e32632d46ba8de49eb0db6c9424402d26c7cfd21cf80cad845f78162d25",
+ "arabicmmlu_high_history": "db21ec3b92313a8ff84eea1ef253bd9fd311b799b7255530752c9d9d42582e31",
+ "arabicmmlu_primary_islamic_studies": "948fda0d0bc5d6b7f3d4778361317c5f1ccd749e82071cec7710ebe034f8e5cf",
+ "arabicmmlu_middle_history": "06d1eee1e75a711e0f6e4b6209b1ddf2b7b9ac8fd4e9e19c83bc260664e9da92",
+ "arabicmmlu_primary_history": "236ef1dc7fe81ba7e3abf7f4c0f706e5cf1932692f6bb670df7fcdd8118843ee",
+ "arabicmmlu_high_computer_science": "b94390a6fd058297d59d43575ce189c833d75fd636894320989d8628b074f002",
+ "arabicmmlu_primary_math": "7fbd73f73bc85611f0495ed87530d6512d9da9e0c92fe25553a591b91ef4e79d",
+ "arabicmmlu_high_biology": "daeac852f0eb44834936f0a04bc71521d2b9d939d47e7976b80f1e576b7688c1",
+ "arabicmmlu_primary_computer_science": "bb40dbb3bf51122ea2a0cc30848e010b71de881a8b7a6b5f11e97c36867431e6",
+ "arabicmmlu_middle_natural_science": "5d3ab2bf4ca8633ecf28783ae2d05d0025d3af21add23eadd96cea54c63427cf",
+ "arabicmmlu_high_physics": "defccd1d721b1ba615956f253ad5f61f383b5f8a9d2aae786b58bbd212f87ec1",
+ "arabicmmlu_middle_computer_science": "6d88646a6979333723a7697392ef4bf8d9440001ebe886ca85f5461f3a510048",
+ "arabicmmlu_univ_computer_science": "1e38d7bfc8a18b04cc9e57e3ae4e3c11f4d4fc6f07321feba0d36a3122923d0b",
+ "arabicmmlu_primary_natural_science": "fac384e5d9b22d1c20239d6d2563d9f0a79fb48cf615204fcf229fc37c76a008",
+ "arabicmmlu_high_arabic_language": "f4771e89a45e43ae733dcfda251963f5de5383f783d5f534e4ce1999a67b6116",
+ "arabicmmlu_arabic_language_(grammar)": "17e3b209cf3c2d60d47089cdcfdd29f18f8af73b5b9ef05fe6207dfaa0d4c41b",
+ "arabicmmlu_middle_arabic_language": "3332b66219055daebf1b147ad8f648a3edcc672ef99feb2ded597ae8740a995c",
+ "arabicmmlu_arabic_language_(general)": "baa8d90299504f0ee7dd6b57071cf0502218545f926847cd2f30b92be8aeed8b",
+ "arabicmmlu_primary_arabic_language": "70a513c8c604cd2edb7ab15dea6e21908f1a4136dbd98e3a1294a7111dfa4228"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2980.642859002,
+ "end_time": 3340.273846829,
+ "total_evaluation_time_seconds": "359.6309878269999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..77ada3ba35a7fd76277a85514bb5349c4ba7ad88
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.010854826817097195,
+ "acc_norm,none": 0.6666666666666666,
+ "acc_norm_stderr,none": 0.010854826817097195
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617421.4265695,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "etec_v2": "a0d87bf7eb82815b66ea544cb632aafb803526dee24b399f30fdc751be442b60"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1392110.980523203,
+ "end_time": 1392198.883363127,
+ "total_evaluation_time_seconds": "87.90283992397599"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json b/evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c1e3a26faff6c0ab953b7722a246547ea89d567f
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json
@@ -0,0 +1,121 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.515828677839851,
+ "acc_stderr,none": 0.021585885942816244,
+ "acc_norm,none": 0.515828677839851,
+ "acc_norm_stderr,none": 0.021585885942816244
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735662207.0830526,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "exams_ar": "b1561abd56354d570ac16bf64163b0ee8dc6c507234b05f678576b09c26c644a"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2867.397536365,
+ "end_time": 2948.510496752,
+ "total_evaluation_time_seconds": "81.11296038699993"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..36484e86ead83bc0cd60c1ac58666b01c1fa7f5f
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json
@@ -0,0 +1,549 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.4452527279568544,
+ "acc_stderr,none": 0.0038711388833064567,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.40667903525046384,
+ "acc_stderr,none": 0.009463939247454995
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.35919854280510016,
+ "acc_stderr,none": 0.009158766245747282
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.40154582259845417,
+ "acc_stderr,none": 0.009406284814832203
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.5464114832535886,
+ "acc_stderr,none": 0.015407801869520031
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.34508196721311474,
+ "acc_stderr,none": 0.013616100682624904
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.6057851239669422,
+ "acc_stderr,none": 0.014054411207805699
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.3941717791411043,
+ "acc_stderr,none": 0.013537713096332765
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.473972602739726,
+ "acc_stderr,none": 0.026171590093068537
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5727788279773157,
+ "acc_stderr,none": 0.009620311542503682
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.4452527279568544,
+ "acc_stderr,none": 0.0038711388833064567,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735664096.2650902,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "gat_analogy": "ede28dec097bfebe8a85a19fa27d001696858276df66254bdb70fc63231f1a83",
+ "gat_association": "5d82550d46c4f3cabf370185a8a23cc2eb5b08f1f0c5e210a8a712562a44bd08",
+ "gat_completion": "fc3c19dd7f1896696fec1bffc21182804c9b2f1fb8d8c882428a6bb4bb61e370",
+ "gat_reading": "93053b187a750d2e87f5488f2d0fda944f3da9195bb04d1c4dee9c4b56fa626a",
+ "gat_algebra": "77832c595eaaf156775c3dbb27da0915ef600ebf46a7113ae32a202b0359e8a6",
+ "gat_arithmetic": "6a498f75f5cc0ffd1b30f7a6293ba80d08f2a8876d5558d8e934bf57355ff0cc",
+ "gat_comparisons": "acb80c0ed8dd07e916a471189aef3a546efc289824b2cc50a32c11dc4c97c9c1",
+ "gat_contextual": "de063ed3b94011d74ee24a6532122c9d344fc15e42800db44f0849995a0bc37a",
+ "gat_geometry": "3e482885559a4404ee9e97556edc6e49959770a499f4ae2c58f18ad85b91a363"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4756.376698655,
+ "end_time": 5124.76942052,
+ "total_evaluation_time_seconds": "368.39272186499966"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..299df33240ce03b33c222b3a2e07dad3fce1b939
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.9177177177177177,
+ "acc_stderr,none": 0.002749455634736978,
+ "acc_norm,none": 0.9177177177177177,
+ "acc_norm_stderr,none": 0.002749455634736978
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617571.8184838,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_mcq": "504533b140426f12c89d975ef421328fc89d69af8719c420a1bf897ed4724191"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1392261.292633723,
+ "end_time": 1392626.942167409,
+ "total_evaluation_time_seconds": "365.64953368599527"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b3201ed5a057a79db5542687198ead1f0fc5d301
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.8294693456980937,
+ "acc_stderr,none": 0.004929073554117403,
+ "acc_norm,none": 0.8294693456980937,
+ "acc_norm_stderr,none": 0.004929073554117403
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617995.3462336,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_tf": "8701a646f6ea8b9bb96c028f817fbeabfb9031580f5054368b43d14d4a5a1270"
+ },
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
+ "start_time": 1392684.818305694,
+ "end_time": 1392900.218863064,
+ "total_evaluation_time_seconds": "215.40055736992508"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json b/evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..30e95539e2473a32f1b19e223dc02454ffef260d
--- /dev/null
+++ b/evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json
@@ -0,0 +1,2707 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.4900662251655629,
+ "acc_stderr,none": 0.00883192107765626,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.6842105263157895,
+ "acc_stderr,none": 0.037827289808654685
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.6597222222222222,
+ "acc_stderr,none": 0.039621355734862175
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.35,
+ "acc_stderr,none": 0.047937248544110196
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.37254901960784315,
+ "acc_stderr,none": 0.04810840148082633
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.71,
+ "acc_stderr,none": 0.045604802157206845
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.548936170212766,
+ "acc_stderr,none": 0.032529096196131965
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3684210526315789,
+ "acc_stderr,none": 0.04537815354939391
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5103448275862069,
+ "acc_stderr,none": 0.04165774775728763
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.48677248677248675,
+ "acc_stderr,none": 0.025742297289575142
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.6645161290322581,
+ "acc_stderr,none": 0.026860206444724352
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.4630541871921182,
+ "acc_stderr,none": 0.035083705204426656
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.56,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.35185185185185186,
+ "acc_stderr,none": 0.02911661760608301
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.37748344370860926,
+ "acc_stderr,none": 0.039580272311215706
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.4675925925925926,
+ "acc_stderr,none": 0.03402801581358966
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6834811529933481,
+ "acc_stderr,none": 0.01087157296938379,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7333333333333333,
+ "acc_stderr,none": 0.03453131801885417
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7254901960784313,
+ "acc_stderr,none": 0.03132179803083291
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7721518987341772,
+ "acc_stderr,none": 0.027303484599069415
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7355371900826446,
+ "acc_stderr,none": 0.04026187527591205
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.6851851851851852,
+ "acc_stderr,none": 0.04489931073591311
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6871165644171779,
+ "acc_stderr,none": 0.03642914578292404
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.6077170418006431,
+ "acc_stderr,none": 0.027731258647011987
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.595679012345679,
+ "acc_stderr,none": 0.027306625297327698
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7251461988304093,
+ "acc_stderr,none": 0.034240429246915824
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5571476736345247,
+ "acc_stderr,none": 0.0062200183711956835,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4740740740740741,
+ "acc_stderr,none": 0.04313531696750575
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5773584905660377,
+ "acc_stderr,none": 0.030402331445769537
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.5086705202312138,
+ "acc_stderr,none": 0.0381189098894041
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.3888888888888889,
+ "acc_stderr,none": 0.04360314860077459
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7121212121212122,
+ "acc_stderr,none": 0.03225883512300992
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.7302752293577982,
+ "acc_stderr,none": 0.01902848671111545
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6278026905829597,
+ "acc_stderr,none": 0.0324430528300873
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.41964285714285715,
+ "acc_stderr,none": 0.04684099321077106
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.66,
+ "acc_stderr,none": 0.04760952285695237
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.7573435504469987,
+ "acc_stderr,none": 0.015329888940899873
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6601307189542484,
+ "acc_stderr,none": 0.027121956071388856
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.41843971631205673,
+ "acc_stderr,none": 0.029427994039419994
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.41264667535853977,
+ "acc_stderr,none": 0.012573836633799016
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.5735294117647058,
+ "acc_stderr,none": 0.030042615832714857
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.5522875816993464,
+ "acc_stderr,none": 0.020116925347422425
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4759036144578313,
+ "acc_stderr,none": 0.03887971849597264
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5578210590383444,
+ "acc_stderr,none": 0.008094265116110859,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.04725815626252609
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.772020725388601,
+ "acc_stderr,none": 0.03027690994517826
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.5692307692307692,
+ "acc_stderr,none": 0.025106820660539753
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.5756302521008403,
+ "acc_stderr,none": 0.03210479051015776
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6641221374045801,
+ "acc_stderr,none": 0.04142313771996664
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.7281553398058253,
+ "acc_stderr,none": 0.044052680241409216
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8076923076923077,
+ "acc_stderr,none": 0.025819233256483727
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.5751445086705202,
+ "acc_stderr,none": 0.026613350840261746
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2916201117318436,
+ "acc_stderr,none": 0.015201032512520442
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5727272727272728,
+ "acc_stderr,none": 0.047381987035454834
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6693877551020408,
+ "acc_stderr,none": 0.030116426296540603
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6915422885572139,
+ "acc_stderr,none": 0.032658195885126966
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.81,
+ "acc_stderr,none": 0.039427724440366234
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.4900662251655629,
+ "acc_stderr,none": 0.00883192107765626,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6834811529933481,
+ "acc_stderr,none": 0.01087157296938379,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5571476736345247,
+ "acc_stderr,none": 0.0062200183711956835,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5578210590383444,
+ "acc_stderr,none": 0.008094265116110859,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_international_law",
+ "openaimmlu_world_religions"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_management",
+ "openaimmlu_security_studies",
+ "openaimmlu_business_ethics",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_public_relations",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_marketing"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_nutrition",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_anatomy",
+ "openaimmlu_virology",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_machine_learning",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_professional_law",
+ "openaimmlu_college_medicine",
+ "openaimmlu_formal_logic",
+ "openaimmlu_global_facts"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_college_physics",
+ "openaimmlu_astronomy",
+ "openaimmlu_computer_security",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_college_biology",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_biology"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735663577.7452598,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "openaimmlu_college_physics": "61aa73bd44d8ef4ba6cb739692c6eb8cabf49e8896a7f725463819ef0dbd0132",
+ "openaimmlu_astronomy": "4c75961650ca77c7fb21671a45e42b30b2d6737dd89a9dd0f98b5a102a6fd21c",
+ "openaimmlu_computer_security": "b3b48aa3be2791a75a9678e21c3f7000c0994934e0892e21be48b61eee9022b1",
+ "openaimmlu_elementary_mathematics": "7ea44fa3e77564b6d8928cb20f739393b20c6df004e164290d5d90ef3d0a8b79",
+ "openaimmlu_high_school_chemistry": "6d6a118469563d3ce711f5e5ca944e10ed6ea4e52c813217124fc043b7423db6",
+ "openaimmlu_college_mathematics": "005761dd03c5fd7ac82e8a629717d9fa02e750f7f4913433240ae0886e421bc6",
+ "openaimmlu_college_chemistry": "0539b9d96465df48c1370ac576a07d6d92f0829fe05cc79bd260ff763a74263d",
+ "openaimmlu_college_biology": "ac595a195f3fe505c334d8ed12697594cafcbaca1d3247eb8d70a3562f41443e",
+ "openaimmlu_conceptual_physics": "7e7cb338548eaf777c9eb7cca310fdf726660871c640842032735cc891816586",
+ "openaimmlu_high_school_statistics": "94f1ab74a8bdbd75041a62e1855c3d15b6ade91a7cc96d274bf57c420c5e7a91",
+ "openaimmlu_electrical_engineering": "599ce9a4e0938a3911259b3556952c70a5d9ac08df41625179cc73cb45a9797a",
+ "openaimmlu_high_school_computer_science": "f89803071a28b442ab2f85f9dc6f5c3acc87118b662626c1e164aae4304f45ea",
+ "openaimmlu_high_school_mathematics": "571309ff8e58fb1d5741d2d95ef005ff09d7e1fc61e75a59fd9bf77d1e4ec25e",
+ "openaimmlu_abstract_algebra": "18718e53c9eb375b294dc89fddd44e9fec66166611545df741957cb9c3056597",
+ "openaimmlu_high_school_physics": "b6a0c08c931f22af3809aba7b65315bf82834cf089961e2cada1bc6dab063306",
+ "openaimmlu_college_computer_science": "614eb77451f839f693631aef6269e65c82e88ad3aa7105f665f4e6187723f986",
+ "openaimmlu_econometrics": "5b060aa4148ab3c9e801d0591d391b90a21259c436d082120f19a16ce63f7c15",
+ "openaimmlu_high_school_biology": "fbd661d888bdfd56e1256684914f1b2d2e90f128b26503e0e0d10af6af678e01",
+ "openaimmlu_nutrition": "c44694a990c0a1187712f3e7d83ee10b5682fde624260c4b78bbd33641647f01",
+ "openaimmlu_miscellaneous": "a42b3d1263bfa01552f44579362d25e558662731a595c2dec558d8c9ca4d727b",
+ "openaimmlu_anatomy": "a2158278024b1e9f8867e30434721221f91cbcebaa34c3ca065cc039f6d9ce56",
+ "openaimmlu_virology": "3fd44c94e0170284d5232b194c5604d338d0ace9cd0ff686d2349b0e7c2e19d6",
+ "openaimmlu_professional_medicine": "c45c22a09efc77881a194f39b9622414eea01fdd59a4ce6fec12ca0bd542f73b",
+ "openaimmlu_human_aging": "4216c0274bb171bdb7c8cc2640cd812401e292271f4ee2b95b73b73a48b061c1",
+ "openaimmlu_clinical_knowledge": "3de23a26358560ceb58b2bd43bf1ca0873f1bde03f92f16048a7fe73ef086f3d",
+ "openaimmlu_professional_accounting": "b08f816170cc1f742b5d62eca448427a3d57369d56d0db3349d79f0d9da3aec3",
+ "openaimmlu_high_school_geography": "5f781f776d42c8b641139ad51da36b50da36c450bbec8a01fa33ae25a684133c",
+ "openaimmlu_professional_psychology": "ee9b7102b9f931cb3c7fab155f7b1828160df589ae8b16844039de7ae3c8d064",
+ "openaimmlu_high_school_psychology": "f62b828c33fd2ad378bb7aaf081c290210b378f9309aeef1fff01ddc83dd34c6",
+ "openaimmlu_machine_learning": "4f9232ed92776579ea24eb5fe1ef1275bb2bf8a290f5f004f749cb64b6feee40",
+ "openaimmlu_medical_genetics": "aece3b6adba4255559594b80438a0e80181e24366ff39b1c91456df945b01f4f",
+ "openaimmlu_professional_law": "a86f6179dbad6d30286d1b71dafd2799ae6fa219cd7b3a079bb067483bc64b2a",
+ "openaimmlu_college_medicine": "36232a0c51f93f761adf0529e7004f2f21479517d238b0c9bc538a1138e7482f",
+ "openaimmlu_formal_logic": "5d7a76ba9f40981d143f6fe780a8269bd525cc54aea3fac3dde297b4d4491413",
+ "openaimmlu_global_facts": "14d0542f6b985287da88bf4956de680e20754b9d74a009ca463beb66ae081e92",
+ "openaimmlu_high_school_microeconomics": "51a598169dbd7a3de2e64558fd952a17cd39b49f0c6cb4de95fdb0e5520245c7",
+ "openaimmlu_high_school_government_and_politics": "7c79983103e230916bf4d730743e92feb7f17e308a2897de764d24f49de93f37",
+ "openaimmlu_management": "b85920acbd491f4bafeda9de9dc3af4408e63e0d53bfbeade834107dff6c3e3b",
+ "openaimmlu_security_studies": "de9eccb24ee7d56897728b9bd30c0159f42a6cac86f3d2090504439efcbb2348",
+ "openaimmlu_business_ethics": "5536730d841c70c256991081bd4d8a9c28aaebcab9ac3cdd36ceb1aad896cfae",
+ "openaimmlu_sociology": "1fa4ebb60178be200e3b8167e10fec0843964725a6be034e6893ae42d5dd1a3a",
+ "openaimmlu_high_school_macroeconomics": "938c0435e322f454a5b1f26f1b53870141a9e311bbc95512cf307ecd007e66db",
+ "openaimmlu_moral_scenarios": "c0158287c824e917d6d76a9d4a4e68a53af20dee7bb7c1d372a759546edef562",
+ "openaimmlu_public_relations": "545ea7d7b3dbbe04d2c367cd70142c35e1ce585a3c3e4b9d9fa2290d1d25272f",
+ "openaimmlu_us_foreign_policy": "cddc366ef735093ff1ab6d3660a19d52ae146b9ca18668d8a878be81466cb626",
+ "openaimmlu_moral_disputes": "2b9fd83448202cb343ad8473f9d34194776e73f9fac0fad093610033039e0152",
+ "openaimmlu_human_sexuality": "cd4281e8629dd63b57e11fff680c2813bfc156d0807d9a3424670422bb8a8f02",
+ "openaimmlu_marketing": "85aaada41a32346c0dce6f252b7e5e50a1bce1641cab3ecf6e1590deb8927db4",
+ "openaimmlu_jurisprudence": "ab9de498411479a47a892895a70b20948854fe8c8177f9851da339a984c534f0",
+ "openaimmlu_logical_fallacies": "1c809bb030ca1d7256a741cd2f7b3719053d6387df5f89762fca7aa430374461",
+ "openaimmlu_philosophy": "e6367d4cab84d33e5ba62f20bc52f72d630c4324c0d34628b546cf72a83eb94f",
+ "openaimmlu_high_school_world_history": "f7e18a11fcc4e11b8c758d3227d7e7fc59157c9bd465ade0c8e4707cb3c76c2a",
+ "openaimmlu_high_school_european_history": "f6ee95e6dc273ad3d18c110a69772161a9eb250ef81c3202a46228689c5e2071",
+ "openaimmlu_prehistory": "88d6574515b52d900aab10f1f22d026fa33c8d910a6528acf3dac384d1e82b0a",
+ "openaimmlu_high_school_us_history": "541952c75bfb8c256d813fdfc4d7707ad25448980fd70d39142ec03a15af9d0d",
+ "openaimmlu_international_law": "c1acfc8203c4d2f4d5d9245685804c2b7406601dfd8106cc4fac985915559f52",
+ "openaimmlu_world_religions": "9b68b777a6bc2b05efee57f75e87792c6b14f39464621c16e4c24b024aeb2630"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4238.00553465,
+ "end_time": 4722.113520369,
+ "total_evaluation_time_seconds": "484.10798571899977"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json b/evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..cb0bd0096c008908dc64a2311ffe2a92fe6c545a
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.6045924225028703,
+ "acc_stderr,none": 0.00523925695392083,
+ "acc_norm,none": 0.5897818599311137,
+ "acc_norm_stderr,none": 0.005270708411925859
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736889821.9957027,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "acva": "f573ae5740e68711d257f2dc4a23db7c6b1c04895364f1af4b4eb64bfab793a4"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 600072.370318618,
+ "end_time": 600217.222010416,
+ "total_evaluation_time_seconds": "144.85169179795776"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..537637449aa19d1166d33ca6db66eedc3df36ac7
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.08582089552238806,
+ "prompt_level_strict_acc_stderr,none": 0.012109752724743699,
+ "inst_level_strict_acc,none": 0.47918088737201364,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.13805970149253732,
+ "prompt_level_loose_acc_stderr,none": 0.014914035308708435,
+ "inst_level_loose_acc,none": 0.5276450511945392,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621196.897086,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "ca837eed1e9f468712643d1fab81b7b48c88a8799239851476bdc889990e6b41"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1395880.012817552,
+ "end_time": 1401371.318791154,
+ "total_evaluation_time_seconds": "5491.305973601993"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5a089641faf8de72b9fe597f7e7213f1b4fe5b50
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.5652892561983471,
+ "acc_stderr,none": 0.020170519477736983,
+ "acc_norm,none": 0.5652892561983471,
+ "acc_norm_stderr,none": 0.020170519477736983
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621084.921236,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "b7e29b20c532c7420cc659c6586d56642070560abff0925ed01ad8f200d8e72b"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1395768.116667791,
+ "end_time": 1395816.745740765,
+ "total_evaluation_time_seconds": "48.629072973970324"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ab8446028b82f261ab5b2774b9617e67ce808861
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.41471705658868224,
+ "acc_stderr,none": 0.006967450316480296,
+ "acc_norm,none": 0.41471705658868224,
+ "acc_norm_stderr,none": 0.006967450316480296
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617143.3614087,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "063166ad2e52146b6a051c978bf54b1397281e222da633e81fa50357d2409ee9"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1391826.416201954,
+ "end_time": 1394850.089034202,
+ "total_evaluation_time_seconds": "3023.672832248034"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..2c7362013abb2d94557592c3eef42693b03d6881
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2090 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.4208232445520581,
+ "acc_stderr,none": 0.004040113223189638,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.44239250275633957,
+ "acc_stderr,none": 0.008046896182334524,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.3144736842105263,
+ "acc_stderr,none": 0.016853237146172328
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.4221556886227545,
+ "acc_stderr,none": 0.02706572265618471
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.5128205128205128,
+ "acc_stderr,none": 0.08108404256842
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.3489827856025039,
+ "acc_stderr,none": 0.01887069517251757
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.42857142857142855,
+ "acc_stderr,none": 0.03481904844438804
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.49159663865546216,
+ "acc_stderr,none": 0.03247390276569669
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.37254901960784315,
+ "acc_stderr,none": 0.04810840148082635
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.6016016016016016,
+ "acc_stderr,none": 0.01549701356425835
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.4426751592356688,
+ "acc_stderr,none": 0.028075313057827626
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.4161603888213852,
+ "acc_stderr,none": 0.011940274964070782,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.5098039215686274,
+ "acc_stderr,none": 0.0202239460050743
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.3643835616438356,
+ "acc_stderr,none": 0.02522471433569769
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.27692307692307694,
+ "acc_stderr,none": 0.022688042352424994
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.09745089103411436
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.47619047619047616,
+ "acc_stderr,none": 0.031523917851640645
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.47020933977455714,
+ "acc_stderr,none": 0.009934531753088865,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.5260115606936416,
+ "acc_stderr,none": 0.014354525266560796
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.3854166666666667,
+ "acc_stderr,none": 0.016567242795987865
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.45348837209302323,
+ "acc_stderr,none": 0.03807016210250966
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.4691358024691358,
+ "acc_stderr,none": 0.03933037336475501
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.5866666666666667,
+ "acc_stderr,none": 0.05724401171194134
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.3818493150684932,
+ "acc_stderr,none": 0.00812527639293321,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.28735632183908044,
+ "acc_stderr,none": 0.048797477314965754
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.425,
+ "acc_stderr,none": 0.026090425569673732
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.30346820809248554,
+ "acc_stderr,none": 0.014277024139952538
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.3686440677966102,
+ "acc_stderr,none": 0.031470730682346106
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.4827586206896552,
+ "acc_stderr,none": 0.05388432214060092
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.3639705882352941,
+ "acc_stderr,none": 0.029227192460032025
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.33195020746887965,
+ "acc_stderr,none": 0.03039731808552683
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.43859649122807015,
+ "acc_stderr,none": 0.0663095566682855
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.4978723404255319,
+ "acc_stderr,none": 0.01884428842004545
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.4189189189189189,
+ "acc_stderr,none": 0.05774600244608328
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.38686131386861317,
+ "acc_stderr,none": 0.041762602685795874
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.36666666666666664,
+ "acc_stderr,none": 0.03333333333333339
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.4030692139054181,
+ "acc_stderr,none": 0.008590519358095423,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.34776437189496096,
+ "acc_stderr,none": 0.012692391957016312
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.4099616858237548,
+ "acc_stderr,none": 0.030501771826233565
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.30196078431372547,
+ "acc_stderr,none": 0.02880701939354399
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.5185185185185185,
+ "acc_stderr,none": 0.09799078929868854
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.4256198347107438,
+ "acc_stderr,none": 0.03184946380154992
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.46842105263157896,
+ "acc_stderr,none": 0.03629703808831611
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5476772616136919,
+ "acc_stderr,none": 0.024640895323937397
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.46130952380952384,
+ "acc_stderr,none": 0.02723600815931351
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.4375,
+ "acc_stderr,none": 0.0625
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.4208232445520581,
+ "acc_stderr,none": 0.004040113223189638,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.44239250275633957,
+ "acc_stderr,none": 0.008046896182334524,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.4161603888213852,
+ "acc_stderr,none": 0.011940274964070782,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.47020933977455714,
+ "acc_stderr,none": 0.009934531753088865,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.3818493150684932,
+ "acc_stderr,none": 0.00812527639293321,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.4030692139054181,
+ "acc_stderr,none": 0.008590519358095423,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_high_arabic_language"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_math"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_middle_history"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736889500.3930833,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "arabicmmlu_primary_general_knowledge": "91aa1e48a6f5ccff48fa6fa3277bbc97d23e6416fde69528f8956d0e90bc6244",
+ "arabicmmlu_driving_test": "69f79faf8c303370c2df3ec536dd4c3cad19cf2cda6a1e77cff4852c0ebb14ee",
+ "arabicmmlu_univ_management": "2ecfab399c12f6df05e9fd3a1db2573e7c48f5fa49566ce280a668a29896c4e3",
+ "arabicmmlu_middle_general_knowledge": "e6929eb4f7ad78ad5b6b1141e390ce2c789a3ae9d3cf0ffeccac415a4212dcde",
+ "arabicmmlu_general_knowledge": "1dfc3c92b60733bdc6f46f0f8268ac5feff7a327546595fff67ca2a4fa76ed4c",
+ "arabicmmlu_middle_economics": "5583d598d2fe7413e6314b657d446ca334756888066f9fe8c2194c3a06458553",
+ "arabicmmlu_univ_accounting": "9870f7d5ef58b1a884b890d26139fc3d9e3988082909e5b342eb220c40c74994",
+ "arabicmmlu_high_geography": "d1b6c33005a743500852a60611f03f8827f80ab343290f5b1e5a3b1d4293e77f",
+ "arabicmmlu_univ_political_science": "f27497dde305b538488920449e92ab0d4bfe35e4189e4212ad92e2fb76148e97",
+ "arabicmmlu_middle_social_science": "1d25ab6b44a1b26de084ab850a913531b607a6e2803d8a79ce6863c832c22a95",
+ "arabicmmlu_univ_economics": "f6cf4679eeae2e6eeb62050825cf38761c81d23b449aadd64e6adab85bbed352",
+ "arabicmmlu_primary_geography": "e35f11ec9fef451aba1e9477b5d9486442f90bc2ea2e5e308a41c55dbce411fd",
+ "arabicmmlu_middle_geography": "ed8b5cb8778ba57b3bfde2668f5c9bb71bff970583f294f428883c68bb9ae454",
+ "arabicmmlu_high_economics": "a34142d316652408881b759c7330f4f661a7346e6771f5f22e85d19db23d7bfb",
+ "arabicmmlu_high_civics": "a6da6b37a218224abfbdf5816c27d5c52546e3d4dbd6f7eed7a4979516c21acd",
+ "arabicmmlu_middle_civics": "812fa9145e919b429ec0bda856075de404bf052193261c9dd9e07f80258b9b76",
+ "arabicmmlu_primary_social_science": "58e86199fce5371c2af5e271fcf9beea7fb9947c6d72f921bc13d3caef2e7ec4",
+ "arabicmmlu_islamic_studies": "6c27b44beb48e9774cb7d01f7b365291fe562fb35c8f2e0872f119b67f778c1c",
+ "arabicmmlu_primary_history": "5f53990b8b0f0fe784c8297459f6591dbe8cbe04ce72de398525009c23591197",
+ "arabicmmlu_high_history": "57c73385cc86d08a8d9da669118dd92f96b286567635901b01da5d79c898a4aa",
+ "arabicmmlu_primary_islamic_studies": "f3d423a3b1b1b5b1128ea8428035df4b08c998c5450f38eb80cae4b79874fa2a",
+ "arabicmmlu_prof_law": "731e89e57ca52310b4b446fea6ed38cebee09362abf58651c81281646b692f23",
+ "arabicmmlu_high_islamic_studies": "a165da3444067e26499b01625e631eba032e28cc1fc6b6aa9030b53671452436",
+ "arabicmmlu_middle_islamic_studies": "75c973380c08f25822af4104db06901b5c6a0cdb1a628d2bd90bcf8526a1ef5e",
+ "arabicmmlu_high_philosophy": "8cbea21a7922a09751bd6d1eca16be8570a3544536dbe12de03731e194ad50c2",
+ "arabicmmlu_middle_history": "1e1f90835724b6b5ce6297d91a656ac226e8210bdba020e4b9e3b6817a6414a9",
+ "arabicmmlu_middle_computer_science": "249aebce1d740b259a5479569a981b9a343cd9fa8e309cfb0bcd53253c3a7a2e",
+ "arabicmmlu_primary_computer_science": "27439beeb6cb7c0cc4cee804b7d1f9e7251a94c644aae58fd3206d35e2aa93d2",
+ "arabicmmlu_high_computer_science": "a8c4e2ea301b4a23c47173ec3273d443028e21dc6fdb9d42e1b675220b4689ed",
+ "arabicmmlu_primary_natural_science": "36e1bf3486dee3ffd262d051d99429869f8627e05fa7798d1d3f586992796fe7",
+ "arabicmmlu_middle_natural_science": "a6a93f918a781ecab3b7ab692d645b199292403794c9bf2883ecb1ecace32e0b",
+ "arabicmmlu_univ_computer_science": "5b38c4b463a4be775770bb51f341e7744c7e154afb4802e09cf199951621be99",
+ "arabicmmlu_high_physics": "605d008475723d413ddeee9ab64db12fa85fa6ac0d0f029891694dfe5f7d3911",
+ "arabicmmlu_high_biology": "04bfaf2a7d77c83199c24e03ab8f94c5e18b5006bc042901c3b6be100621a6db",
+ "arabicmmlu_primary_math": "6169402e232f04147465bba4bc8be27e400675676c6d6c1951aabfadf2077e99",
+ "arabicmmlu_arabic_language_(grammar)": "179324a1e8e0ccf6413411a6541c88130d78d31f0fec7274f9bfc19484f77a85",
+ "arabicmmlu_middle_arabic_language": "7f510542f64580f95a35baf4533ed39fc59f6efe2a89af570675e4d9e30cf7f9",
+ "arabicmmlu_arabic_language_(general)": "c342dca15e7dcbbe9b320f3726484abbec23656545fa3195a0014ad5d385e75c",
+ "arabicmmlu_primary_arabic_language": "11ae5685e1cc66af215f4b43d45b2fcf6376e9389390c7e3aed3414122a935a1",
+ "arabicmmlu_high_arabic_language": "2e5e0e90e40a42af3b2d5556d603782a252cc1350e65cf2654aaaa95e3e0cd06"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 599750.782067174,
+ "end_time": 599905.082863244,
+ "total_evaluation_time_seconds": "154.30079607001971"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d41da376e179a2e392bd8a751492d6b3dca32cbf
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.3751987281399046,
+ "acc_stderr,none": 0.01114886834610489,
+ "acc_norm,none": 0.3751987281399046,
+ "acc_norm_stderr,none": 0.01114886834610489
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739620236.678696,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "3a8dc6484af6c9538f122c1bbe5c6866dbe14df841fdf04ab7ff2b6437e8aeae"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1394919.684315533,
+ "end_time": 1394995.42617788,
+ "total_evaluation_time_seconds": "75.7418623471167"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..470702d0cd8b06409c52ec6de37997139d9ef69f
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.31843575418994413,
+ "acc_stderr,none": 0.020122499132803468,
+ "acc_norm,none": 0.31843575418994413,
+ "acc_norm_stderr,none": 0.020122499132803468
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736889028.6416683,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "exams_ar": "f52ab3f14b240558420910fdb453ccb45c945cec187c0e60ea51cf6eff08973a"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 599279.04705073,
+ "end_time": 599692.233103212,
+ "total_evaluation_time_seconds": "413.1860524819931"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..58edb7a4b54ae084b182cd55496993ea2786f2e7
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json
@@ -0,0 +1,553 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.27994481374639407,
+ "acc_stderr,none": 0.003542796359675536,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.2571428571428571,
+ "acc_stderr,none": 0.008420562208967575
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.24553734061930782,
+ "acc_stderr,none": 0.008216476082874105
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.26573426573426573,
+ "acc_stderr,none": 0.008475894211016492
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.24019138755980862,
+ "acc_stderr,none": 0.013221495215360054
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.319672131147541,
+ "acc_stderr,none": 0.013357022766710734
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.27520661157024795,
+ "acc_stderr,none": 0.012844683062506254
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.26993865030674846,
+ "acc_stderr,none": 0.01229815625441917
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.2876712328767123,
+ "acc_stderr,none": 0.023726723391354485
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.3568998109640832,
+ "acc_stderr,none": 0.009317121354774414
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.27994481374639407,
+ "acc_stderr,none": 0.003542796359675536,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736891004.0192773,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "gat_analogy": "04ac010c48ed039457058b512b7ac0586c7c76a628da7caaf9aeb8f3e99ae5e3",
+ "gat_association": "2cbd868d220125bfcc54ae738592ad902191e4b7f804ce1772ae29e2d3bb3bf6",
+ "gat_completion": "74cf159ef4a3455a6a0e984fed8e9e9a12f0dc21fde95c2058216c5a711a4d31",
+ "gat_reading": "6f21934e536e7dca65361d01e5cafc27f8070c4f0dccf5a88c1fe071194b78a4",
+ "gat_algebra": "20750c926608570eaf87d29981e5ab49b2b097bd52d7f749c44ab4e175d9fdd2",
+ "gat_arithmetic": "c4b0c73c269d9eb3e8482fbda42e69191c28b95e75e1517d5f9142c6ef410204",
+ "gat_comparisons": "88bc22db186a50cab28938ec1fc332366fa0bc886bc98edf810cc9ae938405db",
+ "gat_contextual": "b8e88ff29b62b54eb834dca696304ca0fe1ce55d5cf7d0a9f0204456e3955be6",
+ "gat_geometry": "229545188469d0512a3297737f4ec7afe88d8a30e7e04f87b4982548e83b1e56"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 601254.206185867,
+ "end_time": 601373.470204397,
+ "total_evaluation_time_seconds": "119.26401853002608"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..99c53fd410b3fe4d937b93f9b9171544e031c48e
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.5265265265265265,
+ "acc_stderr,none": 0.004995706870392996,
+ "acc_norm,none": 0.5265265265265265,
+ "acc_norm_stderr,none": 0.004995706870392996
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739620378.768502,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "1ae93edb904d572143b5f36dd5dfcc4b901240916d4735ea328083598c912446"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1395061.894176973,
+ "end_time": 1395336.684131379,
+ "total_evaluation_time_seconds": "274.78995440597646"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..37f8e4a3b0738efc596f5f6f6c9ab4fe2e31080c
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.576335222393955,
+ "acc_stderr,none": 0.006476086786980228,
+ "acc_norm,none": 0.576335222393955,
+ "acc_norm_stderr,none": 0.006476086786980228
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739620722.9521024,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "ed81617ccb178d095c9a81fef15f5ba8b655782b26d36117f53c38b0a84e62e5"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside tags.\\n\\n' + tools|tojson(indent=2) + '\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n\\n' + message['content'] + '\\n\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
+ "start_time": 1395406.00589162,
+ "end_time": 1395704.54657667,
+ "total_evaluation_time_seconds": "298.54068504995666"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b60c0a9c0eb1dfacc88e67bbf25e9e6e02ffc1c5
--- /dev/null
+++ b/evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2711 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.32847682119205296,
+ "acc_stderr,none": 0.008517820734335659,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.35526315789473684,
+ "acc_stderr,none": 0.038947344870133176
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.2708333333333333,
+ "acc_stderr,none": 0.03716177437566016
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.29,
+ "acc_stderr,none": 0.045604802157206845
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.38,
+ "acc_stderr,none": 0.04878317312145634
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.045126085985421296
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.23529411764705882,
+ "acc_stderr,none": 0.04220773659171453
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.30638297872340425,
+ "acc_stderr,none": 0.030135906478517563
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.30701754385964913,
+ "acc_stderr,none": 0.04339138322579861
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.38620689655172413,
+ "acc_stderr,none": 0.04057324734419034
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.40476190476190477,
+ "acc_stderr,none": 0.025279850397404904
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.3161290322580645,
+ "acc_stderr,none": 0.026450874489042767
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.3399014778325123,
+ "acc_stderr,none": 0.033327690684107895
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.34444444444444444,
+ "acc_stderr,none": 0.028972648884844267
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.23841059602649006,
+ "acc_stderr,none": 0.03479185572599657
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.26851851851851855,
+ "acc_stderr,none": 0.030225226160012417
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.3464523281596452,
+ "acc_stderr,none": 0.011178696015775447,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.3939393939393939,
+ "acc_stderr,none": 0.0381549430868893
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.3235294117647059,
+ "acc_stderr,none": 0.03283472056108566
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.3459915611814346,
+ "acc_stderr,none": 0.03096481058878671
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.4628099173553719,
+ "acc_stderr,none": 0.04551711196104218
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.4166666666666667,
+ "acc_stderr,none": 0.04766075165356461
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.3374233128834356,
+ "acc_stderr,none": 0.03714908409935573
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.3408360128617363,
+ "acc_stderr,none": 0.02692084126077616
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.31790123456790126,
+ "acc_stderr,none": 0.025910063528240868
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.27485380116959063,
+ "acc_stderr,none": 0.03424042924691583
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3083277140930546,
+ "acc_stderr,none": 0.0059796238033850944,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.3037037037037037,
+ "acc_stderr,none": 0.03972552884785137
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.30566037735849055,
+ "acc_stderr,none": 0.028353298073322666
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.2832369942196532,
+ "acc_stderr,none": 0.03435568056047874
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.3412698412698413,
+ "acc_stderr,none": 0.042407993275749234
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.3181818181818182,
+ "acc_stderr,none": 0.03318477333845332
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.28807339449541286,
+ "acc_stderr,none": 0.01941644589263603
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.3273542600896861,
+ "acc_stderr,none": 0.031493846709941306
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.23214285714285715,
+ "acc_stderr,none": 0.04007341809755806
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.41,
+ "acc_stderr,none": 0.04943110704237102
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.34738186462324394,
+ "acc_stderr,none": 0.01702667174865574
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.4084967320261438,
+ "acc_stderr,none": 0.028146405993096358
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.25886524822695034,
+ "acc_stderr,none": 0.02612957252718085
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.30182529335071706,
+ "acc_stderr,none": 0.011724350518105888
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.22058823529411764,
+ "acc_stderr,none": 0.02518778666022727
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.2761437908496732,
+ "acc_stderr,none": 0.018087276935663137
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.35542168674698793,
+ "acc_stderr,none": 0.03726214354322415
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.33414485696895924,
+ "acc_stderr,none": 0.008161503557308653,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.04852365870939099
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.26424870466321243,
+ "acc_stderr,none": 0.03182155050916648
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.31794871794871793,
+ "acc_stderr,none": 0.023610884308927865
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.3277310924369748,
+ "acc_stderr,none": 0.030489911417673227
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.4198473282442748,
+ "acc_stderr,none": 0.04328577215262972
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.3106796116504854,
+ "acc_stderr,none": 0.04582124160161551
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.4230769230769231,
+ "acc_stderr,none": 0.032366121762202014
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.31213872832369943,
+ "acc_stderr,none": 0.024946792225272307
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2681564245810056,
+ "acc_stderr,none": 0.014816119635317008
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.35454545454545455,
+ "acc_stderr,none": 0.04582004841505417
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.03136250240935893
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.4129353233830846,
+ "acc_stderr,none": 0.03481520803367348
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.54,
+ "acc_stderr,none": 0.05009082659620333
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.32847682119205296,
+ "acc_stderr,none": 0.008517820734335659,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.3464523281596452,
+ "acc_stderr,none": 0.011178696015775447,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3083277140930546,
+ "acc_stderr,none": 0.0059796238033850944,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.33414485696895924,
+ "acc_stderr,none": 0.008161503557308653,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_international_law",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_world_religions",
+ "openaimmlu_philosophy",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_high_school_us_history"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_management",
+ "openaimmlu_business_ethics",
+ "openaimmlu_security_studies",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_marketing",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_public_relations",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_sociology"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_miscellaneous",
+ "openaimmlu_professional_law",
+ "openaimmlu_machine_learning",
+ "openaimmlu_global_facts",
+ "openaimmlu_anatomy",
+ "openaimmlu_college_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_formal_logic",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_virology",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_nutrition",
+ "openaimmlu_high_school_geography"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_econometrics",
+ "openaimmlu_college_biology",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_high_school_biology"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736890748.3267176,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "openaimmlu_high_school_mathematics": "df02371376ec95c9406e2ff6a36bf2a1ea28d1928668f0b3847898078241cd38",
+ "openaimmlu_college_physics": "35036c7ee551f577e536b265e4c19d6113e6100623a7e81e090dce664abda802",
+ "openaimmlu_computer_security": "160f20db5ddd067bb31a2fc75d678e5e292d74015bb7030b9aa0ea2eb850530b",
+ "openaimmlu_college_computer_science": "c40b1f441d5371cb93738d6ab836aecd34aaea10ac6cbae949c195b033054bfe",
+ "openaimmlu_abstract_algebra": "06a0f6ff7a57c59fb93b15c6ecb5f30709bc2156d0083e8c3a244e40e7f68a91",
+ "openaimmlu_high_school_statistics": "d1413e0f4bfdffb9e7b5926df92fd704175659a6c09f7a6269389ba41485c049",
+ "openaimmlu_college_mathematics": "5ed7c30b039bac914edd2cf744db5c5ff81cf29ff84181f69ea7bc1ee92d59dc",
+ "openaimmlu_college_chemistry": "07f8c55dbe5a1b2b827229d50416d4a998d08a0ffaddc6c42f47fab58de668ac",
+ "openaimmlu_high_school_computer_science": "d918e63bf3588fe06defe04a12d3e016bfdaad25ffe9fe242ee46b263f46f9b2",
+ "openaimmlu_elementary_mathematics": "96ccc5e84dc3ec5cc997298683bb38aeae06c965c11866382ed573cf79958544",
+ "openaimmlu_high_school_physics": "922db7807587177d039bb3bbc3f986ea29ff96b64b45816cda8a649950ded2f2",
+ "openaimmlu_conceptual_physics": "84cfafbb3a9c37c0067098210a14d8297c3d4477276b594a4f7fa40e5a4c43a2",
+ "openaimmlu_econometrics": "b89ef7b8e3fab62d8568d7a74893cf7b69997f3069aa681a263bffbc24ac091f",
+ "openaimmlu_college_biology": "e84687bbe74d124d198f791d2108a4caf5798f3bb803529aed5bae8939345e09",
+ "openaimmlu_electrical_engineering": "d613f287d6cb53521193eaeafef9b1e3bc4d23fec81af05f46df581d85e28930",
+ "openaimmlu_astronomy": "592a0cb02318597a452e074f3d04930eb7b9a12cb492f9ab16aa825bd2d44c1e",
+ "openaimmlu_high_school_chemistry": "04bb34f39ab15ad295823eab6765eba7829393b8aeffb610107a3b52aa75789c",
+ "openaimmlu_high_school_biology": "8f6f966ffa66e26cdd3184854b329477f532e2bc0d3124cdd522f0a4372d524c",
+ "openaimmlu_miscellaneous": "75f78a28f6382ee54628eed866c1f5cad54c9004544d1e0e50bfd43de86dec73",
+ "openaimmlu_professional_law": "56fecd11891fc1a1618e2f4cba7f74b01ce5c78b33a6bbd05f56509225476cd3",
+ "openaimmlu_machine_learning": "4c7c4b0fe2e7df74dfb11cfd51543ec0c2aa90c8e3c4b357efa131ed27e1d802",
+ "openaimmlu_global_facts": "990031c4f7667bbe547f384284195656902d499683bae0de3bed83401d012307",
+ "openaimmlu_anatomy": "b8b9a59680da920ace7de601112bae3ddfeb9373823206cfd3386a3a116bf5ba",
+ "openaimmlu_college_medicine": "895e37b75ab276e4af42f382cf38ebff5905ddba3257b4584646e774d669966d",
+ "openaimmlu_human_aging": "c7dd36b2b571ec34405e12c7b30fbce83ce41cf80c9082a7070d871f1d567a4f",
+ "openaimmlu_formal_logic": "37cadc6de2615a1fb88b3f2b3ceafe577573c3c8c664e1a88115141ec1814cef",
+ "openaimmlu_professional_accounting": "1b66076938a73ae289e8c741b5e9c5273b8e6335e4a03e6f8557de75777cb9e9",
+ "openaimmlu_high_school_psychology": "89965b10f9d92f9da802d530f76075b98c60656759c1b94933e30d1d5a97e275",
+ "openaimmlu_clinical_knowledge": "478cd76f6ae8f18be117885de984c5c24d2afa6b0a03569f6cdee8ce54a4723c",
+ "openaimmlu_professional_psychology": "6ba2fa8d5e37978e940143629fd6c77c948c1ff73d0b6b27cd689a3a73b014d6",
+ "openaimmlu_medical_genetics": "7ffc6e6eb0ecade753d830cea0ce3b782234414a87013916c52d098a10036b1c",
+ "openaimmlu_virology": "f4fffd1b1e41a95fe22eab417715b1f89db7dcaf396450bb54217becba6bea48",
+ "openaimmlu_professional_medicine": "54806e63b4341eae4298c537e02802bebd754820aadefac78c322ac671e91b75",
+ "openaimmlu_nutrition": "b8df6884453ddb38c0801500887a1e8389b44ded3851ea8134112a7ab6e6a9dd",
+ "openaimmlu_high_school_geography": "ee58d4e56a8653d4eb3f9be1c1dfbb1caf93809ffd956f77fbfa2e9fc8e027e9",
+ "openaimmlu_management": "9a4a7c9a8ec87b9cb943c4673386e6317eacd582c93951950ef0ff50c474ee38",
+ "openaimmlu_business_ethics": "aa2fe1f91fabbe1315984d1d654347d7e1a682bb4298ff7e1859a6a968bf1246",
+ "openaimmlu_security_studies": "78d64883d2d76efc502ef6e9489a7a19ed54ee4a6cb07b442db97a749ddcd0f9",
+ "openaimmlu_moral_scenarios": "969d4d5af11704d0747cf141292fac7ed5f12fc2a3ed393bc7bce8fa2b89665c",
+ "openaimmlu_marketing": "7ec3711e36110aace63e7bf63697943476567d2a99bd85a46a7050a72cc7dd3d",
+ "openaimmlu_high_school_government_and_politics": "f68538a9c88b33bc7ca2fc71b48be912c05d21068090825f601376d754fdbd0e",
+ "openaimmlu_public_relations": "b876ceb0f0f6c77cc34d430093471115271df1b78d9630fc9c56c50cd905ab6f",
+ "openaimmlu_high_school_microeconomics": "608604afeb38c3f75321e6387647d8f9ff7114648d1063bb03821ec734a09205",
+ "openaimmlu_us_foreign_policy": "ca19e9ea92549f964755bf00cb8b78af81dbb36ae21a866a3729c8d2c7dc8fe8",
+ "openaimmlu_high_school_macroeconomics": "8521c25e6160374c93bab3ec5f0f2c8379a9baa5d1b9bbf4833f2fc7447721d8",
+ "openaimmlu_moral_disputes": "971dda048fe6a7964cb6426e830e6fc2c434b913b80313d809d8ee5bbe9ba8dd",
+ "openaimmlu_human_sexuality": "b184a75c6c862f5e3954c7933db056f81cbafdd28c84f9106801dcda047eb62a",
+ "openaimmlu_sociology": "c4b3942888782b4892aa35338f4b019277be2647118282a01ca6d4247341d655",
+ "openaimmlu_international_law": "f6989bdda04b24d24bd12a8a9c89552374071ab9b67476934ced71fcc9295030",
+ "openaimmlu_jurisprudence": "46a22b82dd04e2c6c0948be36b4104906b7ebbc5e8f68e91d32c49241548500d",
+ "openaimmlu_high_school_world_history": "252ee950c7a63347297c321eb62582d5efb9a816d6561da934f14f87f8203ace",
+ "openaimmlu_prehistory": "8d6d577689ab2cebbf7b19bc964bfaef4bef05e362395be7bb817ac23144687d",
+ "openaimmlu_world_religions": "5d9ac0fc4f9744e434d301ee543c5876c9d6f420d30a7643230ce157c6ca394f",
+ "openaimmlu_philosophy": "d217b4f827b712ce5f4a08e8fc26629a86c84f93b58bef9002ef376fceedc8ab",
+ "openaimmlu_logical_fallacies": "94ef523dd37da932e84321a1654eeb7b6797c37b4a05c0fd08885893d192b9e8",
+ "openaimmlu_high_school_european_history": "fb273e15b3cbc2c7b4af95fd69ea68fa995204b964acac0c8757920434f1bd36",
+ "openaimmlu_high_school_us_history": "637ba8e2d7ffdea5de66a1c7e2a314f3ec0e7808893d6269100b084bd5167e6e"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 600998.580700401,
+ "end_time": 601190.357399357,
+ "total_evaluation_time_seconds": "191.77669895603321"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5673f8b5b2d1c4b108f5430bbb7f86cbdbd0602b
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7847301951779564,
+ "acc_stderr,none": 0.004404205705558861,
+ "acc_norm,none": 0.769345579793341,
+ "acc_norm_stderr,none": 0.004513957617295361
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737861513.0031924,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 822799.725415956,
+ "end_time": 824041.525682158,
+ "total_evaluation_time_seconds": "1241.8002662019571"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..15a74c3a1be1fc7ac5066bf1187937d197455eac
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.7089552238805971,
+ "prompt_level_strict_acc_stderr,none": 0.019638685568678992,
+ "inst_level_strict_acc,none": 0.8860068259385665,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.7947761194029851,
+ "prompt_level_loose_acc_stderr,none": 0.017460611985170207,
+ "inst_level_loose_acc,none": 0.9208191126279863,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738755018.193393,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "ar_ifeval": "6bd5bfb26ee4f5909e16d66ee0e564fb2a5826815f16755272465c9e03f98a20"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 744977.123888747,
+ "end_time": 758450.608805326,
+ "total_evaluation_time_seconds": "13473.484916579095"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e423aa59cdac9229eb34e8421d3c6599fb94713a
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.7090909090909091,
+ "acc_stderr,none": 0.01848039016780232,
+ "acc_norm,none": 0.7090909090909091,
+ "acc_norm_stderr,none": 0.01848039016780232
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738750317.5038416,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "araMath_v3": "154ea94d6776e7d3980c98343cec49115ef3dc4dab8897fb4668f68494d55c76"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 740276.643313964,
+ "end_time": 740434.169818474,
+ "total_evaluation_time_seconds": "157.5265045099659"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..083c9d01823f32261d35be59f5a8fe047cc01ead
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.7048590281943611,
+ "acc_stderr,none": 0.006450314388729491,
+ "acc_norm,none": 0.7048590281943611,
+ "acc_norm_stderr,none": 0.006450314388729491
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738742514.712935,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "araPro": "ab4849e5668de72a27844a2a354787cbce92af5027f46a32300417b41913c5db"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 732473.787962617,
+ "end_time": 736407.61692168,
+ "total_evaluation_time_seconds": "3933.8289590630447"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f1c74b1f7bb8810fa1e1ef51060a45975db7f2b6
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.7200968523002421,
+ "acc_stderr,none": 0.003653809830387355,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7367695700110254,
+ "acc_stderr,none": 0.007118478408616655,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5644736842105263,
+ "acc_stderr,none": 0.01799733343022178
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7574850299401198,
+ "acc_stderr,none": 0.023487359027875285
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.7435897435897436,
+ "acc_stderr,none": 0.07083413480167725
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.7089201877934272,
+ "acc_stderr,none": 0.017984334664115503
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.7586206896551724,
+ "acc_stderr,none": 0.03010833071801162
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7899159663865546,
+ "acc_stderr,none": 0.026461398717471874
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.7058823529411765,
+ "acc_stderr,none": 0.04533838195929775
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8548548548548549,
+ "acc_stderr,none": 0.011150187682575276
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.767515923566879,
+ "acc_stderr,none": 0.023876360884096247
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.704131227217497,
+ "acc_stderr,none": 0.01074858647087823,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.8169934640522876,
+ "acc_stderr,none": 0.015643069911273347
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.6986301369863014,
+ "acc_stderr,none": 0.024050431713518203
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4717948717948718,
+ "acc_stderr,none": 0.025310639254933903
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.7896825396825397,
+ "acc_stderr,none": 0.025723323024496765
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7564412238325282,
+ "acc_stderr,none": 0.008605534818784389,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.7704376548307185,
+ "acc_stderr,none": 0.012090002524101525
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.7245370370370371,
+ "acc_stderr,none": 0.015207453766372243
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.7848837209302325,
+ "acc_stderr,none": 0.0314225368473594
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7592592592592593,
+ "acc_stderr,none": 0.033694336336687475
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.8266666666666667,
+ "acc_stderr,none": 0.04400382183783964
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.697203196347032,
+ "acc_stderr,none": 0.007663541005039597,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.5977011494252874,
+ "acc_stderr,none": 0.052877049732218045
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.7166666666666667,
+ "acc_stderr,none": 0.023782648315084427
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.6290944123314065,
+ "acc_stderr,none": 0.015000309630517242
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.6228813559322034,
+ "acc_stderr,none": 0.03161605923498462
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7931034482758621,
+ "acc_stderr,none": 0.04368097459950702
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.7389705882352942,
+ "acc_stderr,none": 0.026679252270103114
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.6390041493775933,
+ "acc_stderr,none": 0.031002543340279055
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7368421052631579,
+ "acc_stderr,none": 0.058843894144731304
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.825531914893617,
+ "acc_stderr,none": 0.014303377520795746
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.6621621621621622,
+ "acc_stderr,none": 0.05535729934952123
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.6715328467153284,
+ "acc_stderr,none": 0.04027264457070886
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.6857142857142857,
+ "acc_stderr,none": 0.0321115135399438
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.7062323833385531,
+ "acc_stderr,none": 0.007870570600880707,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.6153300212916962,
+ "acc_stderr,none": 0.012965726952941084
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.7471264367816092,
+ "acc_stderr,none": 0.026956412412778324
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.6509803921568628,
+ "acc_stderr,none": 0.029908319306125593
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.9629629629629629,
+ "acc_stderr,none": 0.03703703703703703
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.8429752066115702,
+ "acc_stderr,none": 0.023435973310697193
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7789473684210526,
+ "acc_stderr,none": 0.030183597428219758
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.7334963325183375,
+ "acc_stderr,none": 0.02188872609697175
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8958333333333334,
+ "acc_stderr,none": 0.016689971269054218
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.05455447255899809
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.7200968523002421,
+ "acc_stderr,none": 0.003653809830387355,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7367695700110254,
+ "acc_stderr,none": 0.007118478408616655,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.704131227217497,
+ "acc_stderr,none": 0.01074858647087823,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7564412238325282,
+ "acc_stderr,none": 0.008605534818784389,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.697203196347032,
+ "acc_stderr,none": 0.007663541005039597,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.7062323833385531,
+ "acc_stderr,none": 0.007870570600880707,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_primary_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_driving_test"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737858946.4669714,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 820233.226282937,
+ "end_time": 821135.688521802,
+ "total_evaluation_time_seconds": "902.4622388649732"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f341d7c1e2309ae3310c9db9d682c6660dd93a5b
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.6883942766295708,
+ "acc_stderr,none": 0.010664745454850943,
+ "acc_norm,none": 0.6883942766295708,
+ "acc_norm_stderr,none": 0.010664745454850943
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738746708.9926562,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "etec_v2": "f9810ea40ab4721486631d02578e3b62811871d66f80ee350dc574ca63d72e12"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 736668.210182346,
+ "end_time": 736927.122919428,
+ "total_evaluation_time_seconds": "258.9127370819915"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..85c8dfed53f54bd8f468780461d96405ca749dc3
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.6573556797020484,
+ "acc_stderr,none": 0.02049932607490297,
+ "acc_norm,none": 0.6573556797020484,
+ "acc_norm_stderr,none": 0.02049932607490297
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737862801.5409079,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 824088.349963979,
+ "end_time": 824352.47927673,
+ "total_evaluation_time_seconds": "264.1293127509998"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/gat_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..42a899655274d7319a286d4f56cb081f7383b2d0
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/gat_0_shot.json
@@ -0,0 +1,545 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.4412391822400602,
+ "acc_stderr,none": 0.0038602448360070085,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.4148423005565863,
+ "acc_stderr,none": 0.00949246890612482
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.3063752276867031,
+ "acc_stderr,none": 0.008800291696618008
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.43908722856091276,
+ "acc_stderr,none": 0.009522657932144745
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.3751196172248804,
+ "acc_stderr,none": 0.014984183551431945
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.4131147540983607,
+ "acc_stderr,none": 0.014102954212147805
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.5702479338842975,
+ "acc_stderr,none": 0.014237301970481165
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.4148773006134969,
+ "acc_stderr,none": 0.013649322722470929
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.473972602739726,
+ "acc_stderr,none": 0.026171590093068544
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5988657844990548,
+ "acc_stderr,none": 0.00953188686023188
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.4412391822400602,
+ "acc_stderr,none": 0.0038602448360070085,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737819997.849324,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 781284.750234253,
+ "end_time": 782185.575911678,
+ "total_evaluation_time_seconds": "900.8256774250185"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f20380d3f7304ba01745f972eb0e135f865c3596
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.795995995995996,
+ "acc_stderr,none": 0.004031937401121064,
+ "acc_norm,none": 0.795995995995996,
+ "acc_norm_stderr,none": 0.004031937401121064
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738747043.1224887,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_mcq": "2f293909f445c6fdbe42ca2044dd07ac3eb752a7c1ea459602a8757356016dd9"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 737002.279803232,
+ "end_time": 737981.71443428,
+ "total_evaluation_time_seconds": "979.4346310478868"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f85672a94b91a93b23404563526e26cc48647817
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7880817448050833,
+ "acc_stderr,none": 0.005355915518300743,
+ "acc_norm,none": 0.7880817448050833,
+ "acc_norm_stderr,none": 0.005355915518300743
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738748085.1630871,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_tf": "ad47da488f5a4855855290b03172f21cc8709d26c8228bac708e4791056290c9"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 738044.375417544,
+ "end_time": 739098.635400457,
+ "total_evaluation_time_seconds": "1054.2599829129176"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Llama-3.3-70B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Llama-3.3-70B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d95785ef28cf93711a2caa5921145ebb21e3ee27
--- /dev/null
+++ b/evaluations/ar/Llama-3.3-70B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.7025352513886911,
+ "acc_stderr,none": 0.0037280323038272477,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.6384105960264901,
+ "acc_stderr,none": 0.00845271816368979,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.41,
+ "acc_stderr,none": 0.049431107042371025
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.7960526315789473,
+ "acc_stderr,none": 0.0327900040631005
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.7916666666666666,
+ "acc_stderr,none": 0.03396116205845334
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.52,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.57,
+ "acc_stderr,none": 0.04975698519562428
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.41,
+ "acc_stderr,none": 0.049431107042371025
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.5686274509803921,
+ "acc_stderr,none": 0.04928099597287533
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.77,
+ "acc_stderr,none": 0.042295258468165044
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.7531914893617021,
+ "acc_stderr,none": 0.028185441301234106
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.6052631578947368,
+ "acc_stderr,none": 0.045981880578165414
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.6413793103448275,
+ "acc_stderr,none": 0.039966295748767186
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.5740740740740741,
+ "acc_stderr,none": 0.02546714904546955
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.8096774193548387,
+ "acc_stderr,none": 0.022331707611823085
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.645320197044335,
+ "acc_stderr,none": 0.033661244890514495
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.04020151261036844
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.44814814814814813,
+ "acc_stderr,none": 0.030321167196316282
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.5496688741721855,
+ "acc_stderr,none": 0.04062290018683775
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.6712962962962963,
+ "acc_stderr,none": 0.032036140846700596
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.8015521064301552,
+ "acc_stderr,none": 0.009312893863787008,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.8242424242424242,
+ "acc_stderr,none": 0.02972094300622445
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.8970588235294118,
+ "acc_stderr,none": 0.02132833757080437
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.8818565400843882,
+ "acc_stderr,none": 0.021011052659878453
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.8512396694214877,
+ "acc_stderr,none": 0.03248470083807196
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.04186091791394607
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7300613496932515,
+ "acc_stderr,none": 0.03487825168497892
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.7588424437299035,
+ "acc_stderr,none": 0.024296594034763426
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.7623456790123457,
+ "acc_stderr,none": 0.023683591837008557
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7719298245614035,
+ "acc_stderr,none": 0.03218093795602357
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.6803776129467296,
+ "acc_stderr,none": 0.0058476578206321,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.6222222222222222,
+ "acc_stderr,none": 0.04188307537595853
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.7132075471698113,
+ "acc_stderr,none": 0.02783491252754407
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6416184971098265,
+ "acc_stderr,none": 0.03656343653353158
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5317460317460317,
+ "acc_stderr,none": 0.04463112720677172
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.8232323232323232,
+ "acc_stderr,none": 0.027178752639044915
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8293577981651377,
+ "acc_stderr,none": 0.016129271025099853
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.7668161434977578,
+ "acc_stderr,none": 0.028380391147094713
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.5625,
+ "acc_stderr,none": 0.04708567521880525
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.85,
+ "acc_stderr,none": 0.0358870281282637
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.822477650063857,
+ "acc_stderr,none": 0.013664230995834838
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.761437908496732,
+ "acc_stderr,none": 0.024404394928087866
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.5319148936170213,
+ "acc_stderr,none": 0.029766675075873866
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.5475880052151239,
+ "acc_stderr,none": 0.012712265105889136
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.7794117647058824,
+ "acc_stderr,none": 0.02518778666022727
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.7205882352941176,
+ "acc_stderr,none": 0.018152871051538816
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4879518072289157,
+ "acc_stderr,none": 0.0389136449583582
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.7471089470480827,
+ "acc_stderr,none": 0.0074744908927775675,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.72,
+ "acc_stderr,none": 0.04512608598542128
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.9067357512953368,
+ "acc_stderr,none": 0.02098685459328973
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.7487179487179487,
+ "acc_stderr,none": 0.021992016662370575
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.7983193277310925,
+ "acc_stderr,none": 0.02606431340630453
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7786259541984732,
+ "acc_stderr,none": 0.03641297081313729
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8058252427184466,
+ "acc_stderr,none": 0.03916667762822582
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8675213675213675,
+ "acc_stderr,none": 0.022209309073165612
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.6907514450867052,
+ "acc_stderr,none": 0.024883140570071755
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.6681564245810055,
+ "acc_stderr,none": 0.015748421208187306
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6272727272727273,
+ "acc_stderr,none": 0.04631381319425465
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7428571428571429,
+ "acc_stderr,none": 0.027979823538744546
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.8159203980099502,
+ "acc_stderr,none": 0.027403859410786848
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.86,
+ "acc_stderr,none": 0.03487350880197771
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.7025352513886911,
+ "acc_stderr,none": 0.0037280323038272477,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.6384105960264901,
+ "acc_stderr,none": 0.00845271816368979,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.8015521064301552,
+ "acc_stderr,none": 0.009312893863787008,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.6803776129467296,
+ "acc_stderr,none": 0.0058476578206321,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.7471089470480827,
+ "acc_stderr,none": 0.0074744908927775675,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_international_law",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_philosophy",
+ "openaimmlu_world_religions"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_management",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_public_relations",
+ "openaimmlu_sociology",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_security_studies",
+ "openaimmlu_marketing",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_business_ethics",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_high_school_microeconomics"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_psychology",
+ "openaimmlu_anatomy",
+ "openaimmlu_human_aging",
+ "openaimmlu_global_facts",
+ "openaimmlu_formal_logic",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_virology",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_college_medicine",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_law",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_nutrition",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_high_school_psychology"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_econometrics",
+ "openaimmlu_college_physics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_computer_security",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_astronomy",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_college_biology",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_computer_science"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 70553706496,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737860280.209131,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 821567.081517706,
+ "end_time": 822756.147458029,
+ "total_evaluation_time_seconds": "1189.0659403229365"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/acva_5_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6b68eb9e9532243038eb6edce7e20e66fe83da0f
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7252583237657865,
+ "acc_stderr,none": 0.004783265499715521,
+ "acc_norm,none": 0.6993111366245695,
+ "acc_norm_stderr,none": 0.004913712570670582
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737779312.1802437,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 26393.300114519,
+ "end_time": 26592.595877222,
+ "total_evaluation_time_seconds": "199.29576270300095"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..729b9e5ef3af68e85aa7dccfd907eeaea7e3f171
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,138 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.5335820895522388,
+ "prompt_level_strict_acc_stderr,none": 0.021568072772161277,
+ "inst_level_strict_acc,none": 0.7931740614334472,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.6156716417910447,
+ "prompt_level_loose_acc_stderr,none": 0.021030466164007045,
+ "inst_level_loose_acc,none": 0.8327645051194539,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=4,data_parallel_size=2,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738654504.3474658,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "ar_ifeval": "7e137a94e1650273c7c8431db3a799d999471d4003bbb61e67fc4369b573a251"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 437265.909736722,
+ "end_time": 437523.975434726,
+ "total_evaluation_time_seconds": "258.06569800397847"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..cdeb776c4b68df6df2747dd912ca79567699807b
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,122 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.35702479338842974,
+ "acc_stderr,none": 0.019495206164626543,
+ "acc_norm,none": 0.35702479338842974,
+ "acc_norm_stderr,none": 0.019495206164626543
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\nالسؤال: {question}\\n{choices}\\nالاجابة:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "من فضلك اختر إجابة واحدة من بين 'A، B، C، D' دون شرح",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738675025.3226728,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "araMath_v3": "544990bff2e8bb7c1408ff006ba780ea68d8d7f78c633fb7035e71e43345d5a4"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 1038640.023630069,
+ "end_time": 1038724.32179284,
+ "total_evaluation_time_seconds": "84.29816277103964"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araPro_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..2855a95b7b69430e955f5ad28cf41f04a36e4594
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5250949810037993,
+ "acc_stderr,none": 0.007062156072028268,
+ "acc_norm,none": 0.5250949810037993,
+ "acc_norm_stderr,none": 0.007062156072028268
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617047.873544,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "araPro": "ab4849e5668de72a27844a2a354787cbce92af5027f46a32300417b41913c5db"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 1609822.907637183,
+ "end_time": 1610372.150443636,
+ "total_evaluation_time_seconds": "549.242806453025"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ef1464bb7cb65f1acbcb2496c7a646efa40fa52d
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.564303009339329,
+ "acc_stderr,none": 0.0040196752630034735,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5587100330760749,
+ "acc_stderr,none": 0.007915141829477251,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.4276315789473684,
+ "acc_stderr,none": 0.01795774617649965
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6766467065868264,
+ "acc_stderr,none": 0.02563288645517917
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.5641025641025641,
+ "acc_stderr,none": 0.08044135838502685
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.38341158059467917,
+ "acc_stderr,none": 0.01924952226173331
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.5960591133004927,
+ "acc_stderr,none": 0.03452453903822032
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.030388353551886797
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.5392156862745098,
+ "acc_stderr,none": 0.049598599663841815
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7267267267267268,
+ "acc_stderr,none": 0.014106487065973238
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.46496815286624205,
+ "acc_stderr,none": 0.02819221844954206
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.56318347509113,
+ "acc_stderr,none": 0.011882048451256877,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6683006535947712,
+ "acc_stderr,none": 0.019047485239360375
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5698630136986301,
+ "acc_stderr,none": 0.02595003437064698
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.3641025641025641,
+ "acc_stderr,none": 0.02439667298509477
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5833333333333334,
+ "acc_stderr,none": 0.031118303728104594
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6272141706924316,
+ "acc_stderr,none": 0.009640611430777322,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6672171758876961,
+ "acc_stderr,none": 0.013546321390449041
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.5474537037037037,
+ "acc_stderr,none": 0.016943370542362845
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6686046511627907,
+ "acc_stderr,none": 0.035996464381795934
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6851851851851852,
+ "acc_stderr,none": 0.036603163762720714
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.05422675115236518
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5547945205479452,
+ "acc_stderr,none": 0.008278003487917672,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4367816091954023,
+ "acc_stderr,none": 0.05348368965287097
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.575,
+ "acc_stderr,none": 0.026090425569673736
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.47398843930635837,
+ "acc_stderr,none": 0.015505727274549675
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4872881355932203,
+ "acc_stderr,none": 0.03260586088180842
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.05083285677753486
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.5845588235294118,
+ "acc_stderr,none": 0.029935342707877746
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5228215767634855,
+ "acc_stderr,none": 0.03224122462224077
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.5789473684210527,
+ "acc_stderr,none": 0.06597717584505354
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7021276595744681,
+ "acc_stderr,none": 0.017236012495765663
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5675675675675675,
+ "acc_stderr,none": 0.057983774751431016
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5547445255474452,
+ "acc_stderr,none": 0.04261688398864188
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.49047619047619045,
+ "acc_stderr,none": 0.034579448570031264
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5327278421547135,
+ "acc_stderr,none": 0.00860088193534487,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.43293115684882894,
+ "acc_stderr,none": 0.013204622401057848
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5708812260536399,
+ "acc_stderr,none": 0.03069551782571805
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.43529411764705883,
+ "acc_stderr,none": 0.031108974626602753
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.7407407407407407,
+ "acc_stderr,none": 0.08594360757264022
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.6818181818181818,
+ "acc_stderr,none": 0.03000291471043612
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6894736842105263,
+ "acc_stderr,none": 0.03365713545671698
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5599022004889975,
+ "acc_stderr,none": 0.024575400500226115
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.7380952380952381,
+ "acc_stderr,none": 0.02402179716619147
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.59375,
+ "acc_stderr,none": 0.061876853828249374
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.564303009339329,
+ "acc_stderr,none": 0.0040196752630034735,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5587100330760749,
+ "acc_stderr,none": 0.007915141829477251,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.56318347509113,
+ "acc_stderr,none": 0.011882048451256877,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6272141706924316,
+ "acc_stderr,none": 0.009640611430777322,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5547945205479452,
+ "acc_stderr,none": 0.008278003487917672,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5327278421547135,
+ "acc_stderr,none": 0.00860088193534487,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_primary_math",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_univ_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_prof_law"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_high_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_civics"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737778654.0503197,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 25735.027525946,
+ "end_time": 25948.04309341,
+ "total_evaluation_time_seconds": "213.01556746400092"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..be0a1ee4d888ecdee4cf7a4e09af4a559f16f775
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.456809750927398,
+ "acc_stderr,none": 0.01147024835105639,
+ "acc_norm,none": 0.456809750927398,
+ "acc_norm_stderr,none": 0.01147024835105639
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617646.5966089,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "etec_v2": "f9810ea40ab4721486631d02578e3b62811871d66f80ee350dc574ca63d72e12"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 1610421.453807966,
+ "end_time": 1610498.158299866,
+ "total_evaluation_time_seconds": "76.70449189981446"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..80124d035fb2a4d9e6cfa029a09bd4a679101bf6
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.5251396648044693,
+ "acc_stderr,none": 0.02156939500417479,
+ "acc_norm,none": 0.5251396648044693,
+ "acc_norm_stderr,none": 0.02156939500417479
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737779550.003421,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 26630.902640257,
+ "end_time": 26676.356655983,
+ "total_evaluation_time_seconds": "45.45401572599803"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/gat_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..825de1e20405202f37b7c4774e34f1f40c7f155f
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/gat_0_shot.json
@@ -0,0 +1,545 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.3090430201931519,
+ "acc_stderr,none": 0.003623669512802982,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.25120593692022264,
+ "acc_stderr,none": 0.008355979196698268
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.2754098360655738,
+ "acc_stderr,none": 0.008527935108212162
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.2929701877070298,
+ "acc_stderr,none": 0.00873304494093164
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.26889952153110047,
+ "acc_stderr,none": 0.013722501896040254
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.26721311475409837,
+ "acc_stderr,none": 0.01267406341937153
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.37355371900826445,
+ "acc_stderr,none": 0.013912503912467983
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.28450920245398775,
+ "acc_stderr,none": 0.012499077975909817
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.2876712328767123,
+ "acc_stderr,none": 0.023726723391354478
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.4400756143667297,
+ "acc_stderr,none": 0.009653784894336059
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.3090430201931519,
+ "acc_stderr,none": 0.003623669512802982,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737768859.2760568,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 15940.236927019,
+ "end_time": 16130.776899079,
+ "total_evaluation_time_seconds": "190.53997205999985"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..cae160f0d0601edcf99f3dc02b06e2e77f006803
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.5922922922922923,
+ "acc_stderr,none": 0.004916788134998954,
+ "acc_norm,none": 0.5922922922922923,
+ "acc_norm_stderr,none": 0.004916788134998954
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617794.6685781,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_mcq": "2f293909f445c6fdbe42ca2044dd07ac3eb752a7c1ea459602a8757356016dd9"
+ },
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 1610569.610297447,
+ "end_time": 1610870.6725387,
+ "total_evaluation_time_seconds": "301.0622412529774"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b8501aa3020812e11caeef24264b5708d77bcef8
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7173278378842521,
+ "acc_stderr,none": 0.005901525152083598,
+ "acc_norm,none": 0.7173278378842521,
+ "acc_norm_stderr,none": 0.005901525152083598
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"صحيحة\",\n \"خاطئة\"\n ]\n #keys =[\"صواب\",\n # \"خطأ\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\nالسؤال:\" +doc[\"Question\"]+\"\\nإجابة:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "فيما يلي عبارات إما صحيحة أو خاطئة حول {{Subject}}\n الرجاء تصنيف العبارة إلى 'صحيحة' أو 'خاطئة' دون شرح ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682459.4089465,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_tf": "8233395e832e9bd87361282558343c4a080c3ea607d00e045339d417c84f4e85"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
+ "start_time": 1222389.067343241,
+ "end_time": 1222491.234081002,
+ "total_evaluation_time_seconds": "102.16673776088282"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Meta-Llama-3.1-8B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..20e3e31800fd823502a09232fc1bac6101f4bbf5
--- /dev/null
+++ b/evaluations/ar/Meta-Llama-3.1-8B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.44666001994017945,
+ "acc_stderr,none": 0.004112616445357971,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.40794701986754967,
+ "acc_stderr,none": 0.008874683686325746,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5328947368421053,
+ "acc_stderr,none": 0.040601270352363966
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.4583333333333333,
+ "acc_stderr,none": 0.04166666666666665
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.04975698519562427
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.35,
+ "acc_stderr,none": 0.047937248544110196
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.35,
+ "acc_stderr,none": 0.0479372485441102
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.35294117647058826,
+ "acc_stderr,none": 0.04755129616062946
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.37446808510638296,
+ "acc_stderr,none": 0.031639106653672915
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.2807017543859649,
+ "acc_stderr,none": 0.042270544512322
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4413793103448276,
+ "acc_stderr,none": 0.04137931034482758
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.3783068783068783,
+ "acc_stderr,none": 0.024976954053155243
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5419354838709678,
+ "acc_stderr,none": 0.028343787250540625
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.41379310344827586,
+ "acc_stderr,none": 0.03465304488406796
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.35555555555555557,
+ "acc_stderr,none": 0.0291857149498574
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.3509933774834437,
+ "acc_stderr,none": 0.038969819642573754
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.3888888888888889,
+ "acc_stderr,none": 0.03324708911809117
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5144124168514412,
+ "acc_stderr,none": 0.011703005860087082,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.5696969696969697,
+ "acc_stderr,none": 0.03866225962879077
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.5245098039215687,
+ "acc_stderr,none": 0.035050931943487976
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.5991561181434599,
+ "acc_stderr,none": 0.031900803894732356
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6115702479338843,
+ "acc_stderr,none": 0.044492703500683836
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.04803752235190192
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.4723926380368098,
+ "acc_stderr,none": 0.0392237829061099
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.47266881028938906,
+ "acc_stderr,none": 0.02835563356832818
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.4228395061728395,
+ "acc_stderr,none": 0.027487472980871598
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.5263157894736842,
+ "acc_stderr,none": 0.038295098689947286
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.4364463924477411,
+ "acc_stderr,none": 0.00633626561036892,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.37037037037037035,
+ "acc_stderr,none": 0.04171654161354544
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5056603773584906,
+ "acc_stderr,none": 0.03077090076385131
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.4508670520231214,
+ "acc_stderr,none": 0.03794012674697029
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.04216370213557835
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.5858585858585859,
+ "acc_stderr,none": 0.035094383488796295
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.5431192660550459,
+ "acc_stderr,none": 0.021357458785226203
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.47533632286995514,
+ "acc_stderr,none": 0.03351695167652628
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04109974682633932
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.56,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.5440613026819924,
+ "acc_stderr,none": 0.01781040392543535
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5294117647058824,
+ "acc_stderr,none": 0.028580341065138286
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3475177304964539,
+ "acc_stderr,none": 0.028406627809590947
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3396349413298566,
+ "acc_stderr,none": 0.01209559250693197
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.47794117647058826,
+ "acc_stderr,none": 0.030343264224213528
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.4035947712418301,
+ "acc_stderr,none": 0.019848280168401164
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.39156626506024095,
+ "acc_stderr,none": 0.03799857454479637
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.46348143639683503,
+ "acc_stderr,none": 0.008379584468677955,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.54,
+ "acc_stderr,none": 0.05009082659620332
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.5440414507772021,
+ "acc_stderr,none": 0.035944137112724366
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.46923076923076923,
+ "acc_stderr,none": 0.025302958890850154
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.5252100840336135,
+ "acc_stderr,none": 0.03243718055137411
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.5267175572519084,
+ "acc_stderr,none": 0.04379024936553894
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.5631067961165048,
+ "acc_stderr,none": 0.04911147107365777
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6324786324786325,
+ "acc_stderr,none": 0.03158539157745636
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.47109826589595377,
+ "acc_stderr,none": 0.02687408588351835
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2569832402234637,
+ "acc_stderr,none": 0.014614465821966342
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.4818181818181818,
+ "acc_stderr,none": 0.04785964010794916
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5836734693877551,
+ "acc_stderr,none": 0.03155782816556164
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6318407960199005,
+ "acc_stderr,none": 0.03410410565495302
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.65,
+ "acc_stderr,none": 0.047937248544110196
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.44666001994017945,
+ "acc_stderr,none": 0.004112616445357971,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.40794701986754967,
+ "acc_stderr,none": 0.008874683686325746,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5144124168514412,
+ "acc_stderr,none": 0.011703005860087082,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.4364463924477411,
+ "acc_stderr,none": 0.00633626561036892,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.46348143639683503,
+ "acc_stderr,none": 0.008379584468677955,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_prehistory",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_international_law",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_security_studies",
+ "openaimmlu_public_relations",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_sociology",
+ "openaimmlu_marketing",
+ "openaimmlu_management",
+ "openaimmlu_business_ethics",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_moral_scenarios"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_nutrition",
+ "openaimmlu_professional_law",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_college_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_global_facts",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_virology",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_formal_logic",
+ "openaimmlu_anatomy"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_astronomy",
+ "openaimmlu_college_physics",
+ "openaimmlu_computer_security",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_college_biology"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "0e9e39f249a16976918f6564b8830bc894c89659",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737779004.899056,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 26085.962482431,
+ "end_time": 26357.741487179,
+ "total_evaluation_time_seconds": "271.77900474799753"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/acva_5_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..99c88de2f99e2c3df91661f8b84a5587b59b8848
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.6222732491389208,
+ "acc_stderr,none": 0.005195116511309794,
+ "acc_norm,none": 0.6025258323765786,
+ "acc_norm_stderr,none": 0.005243945200841987
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "112b79143",
+ "date": 1739212726.4606693,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1323160.590628094,
+ "end_time": 1324067.409366255,
+ "total_evaluation_time_seconds": "906.8187381608877"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/ar_ifeval_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b6a5485a31dfe75a4500679acbce2973d87fb7f0
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.3041044776119403,
+ "prompt_level_strict_acc_stderr,none": 0.019888706432720362,
+ "inst_level_strict_acc,none": 0.6402730375426621,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.3656716417910448,
+ "prompt_level_loose_acc_stderr,none": 0.020822161638297292,
+ "inst_level_loose_acc,none": 0.6839590443686007,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618660.514274,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "b8aedf628540509f53512423803c97c0af76f913e1d9c5626e46aceefce168b2"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1583947.032479211,
+ "end_time": 1595212.6691982,
+ "total_evaluation_time_seconds": "11265.636718989117"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/araMath_v3_5_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1f3684d2011f07c5ca39fef863fb302c897d445e
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.27107438016528923,
+ "acc_stderr,none": 0.01808703482553977,
+ "acc_norm,none": 0.27107438016528923,
+ "acc_norm_stderr,none": 0.01808703482553977
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618557.9082067,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "fc6325d1e91d814a9212e7cd3d01a2ea0128526a5ff5a12b13029293c7b85a14"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1583844.288660905,
+ "end_time": 1583895.209942275,
+ "total_evaluation_time_seconds": "50.921281369868666"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/araPro_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ad99f0639cd37a1624af23e8089abda9be7fcf72
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.4385122975404919,
+ "acc_stderr,none": 0.007017396418135006,
+ "acc_norm,none": 0.4385122975404919,
+ "acc_norm_stderr,none": 0.007017396418135006
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617070.2494006,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "199097343993a3034793f07adc5e21cca4b5d4e6175f4b73353037c1f92be7cc"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1582356.635046546,
+ "end_time": 1582870.225454165,
+ "total_evaluation_time_seconds": "513.5904076187871"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/arabicmmlu_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5c2f4430e921de2644f58b09cd55b0bb7589efb0
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.4527153234175026,
+ "acc_stderr,none": 0.00405621139929555,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.4506615214994487,
+ "acc_stderr,none": 0.007954799407772264,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.3171052631578947,
+ "acc_stderr,none": 0.016891091712197062
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.5359281437125748,
+ "acc_stderr,none": 0.02732900254030424
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.46153846153846156,
+ "acc_stderr,none": 0.0808703820058226
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.29577464788732394,
+ "acc_stderr,none": 0.01806866065136688
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.4039408866995074,
+ "acc_stderr,none": 0.0345245390382204
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.47478991596638653,
+ "acc_stderr,none": 0.03243718055137411
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.46078431372549017,
+ "acc_stderr,none": 0.04959859966384181
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.6376376376376376,
+ "acc_stderr,none": 0.01521574574388687
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.410828025477707,
+ "acc_stderr,none": 0.027808585738331212
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.43924665856622114,
+ "acc_stderr,none": 0.011971390201420818,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.545751633986928,
+ "acc_stderr,none": 0.0201429745537952
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.410958904109589,
+ "acc_stderr,none": 0.025788216239601053
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.27692307692307694,
+ "acc_stderr,none": 0.02268804235242499
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.09245003270420485
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.48412698412698413,
+ "acc_stderr,none": 0.03154381303686602
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.49476650563607083,
+ "acc_stderr,none": 0.00988842552315136,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.5631709331131296,
+ "acc_stderr,none": 0.014258807143831253
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.3761574074074074,
+ "acc_stderr,none": 0.016489858263852093
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.5465116279069767,
+ "acc_stderr,none": 0.03807016210250966
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.5493827160493827,
+ "acc_stderr,none": 0.039212856567980736
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.52,
+ "acc_stderr,none": 0.05807730170189531
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.447203196347032,
+ "acc_stderr,none": 0.008304479397188922,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.3448275862068966,
+ "acc_stderr,none": 0.05125421389342353
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.4111111111111111,
+ "acc_stderr,none": 0.025968631464617472
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.3911368015414258,
+ "acc_stderr,none": 0.015154263144018552
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4152542372881356,
+ "acc_stderr,none": 0.03214449793774544
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.5517241379310345,
+ "acc_stderr,none": 0.05362711627041053
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.4852941176470588,
+ "acc_stderr,none": 0.03035969707904612
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.36099585062240663,
+ "acc_stderr,none": 0.031002543340279052
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.42105263157894735,
+ "acc_stderr,none": 0.06597717584505354
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.5858156028368794,
+ "acc_stderr,none": 0.018564831209206767
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5135135135135135,
+ "acc_stderr,none": 0.05849919621886868
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.4233576642335766,
+ "acc_stderr,none": 0.04236795684728883
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.40476190476190477,
+ "acc_stderr,none": 0.0339525213962775
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.4353272784215471,
+ "acc_stderr,none": 0.008670865554441175,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.3747338537970192,
+ "acc_stderr,none": 0.012900085684381467
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.4789272030651341,
+ "acc_stderr,none": 0.030981131803166275
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.3568627450980392,
+ "acc_stderr,none": 0.030059765026712162
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.09245003270420485
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.4462809917355372,
+ "acc_stderr,none": 0.0320214054542567
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6368421052631579,
+ "acc_stderr,none": 0.03498104083833201
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.4547677261613692,
+ "acc_stderr,none": 0.0246521904429556
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.5416666666666666,
+ "acc_stderr,none": 0.027222899101477363
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.484375,
+ "acc_stderr,none": 0.06296331249416676
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.4527153234175026,
+ "acc_stderr,none": 0.00405621139929555,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.4506615214994487,
+ "acc_stderr,none": 0.007954799407772264,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.43924665856622114,
+ "acc_stderr,none": 0.011971390201420818,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.49476650563607083,
+ "acc_stderr,none": 0.00988842552315136,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.447203196347032,
+ "acc_stderr,none": 0.008304479397188922,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.4353272784215471,
+ "acc_stderr,none": 0.008670865554441175,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_math"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_primary_history"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_high_geography"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_univ_management"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735752674.195445,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 6518.875463969,
+ "end_time": 7152.251648152,
+ "total_evaluation_time_seconds": "633.3761841830001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/etec_v2_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b57989419d6ffd1577f91e29e248ec1183d0a4c0
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.3566507684154743,
+ "acc_stderr,none": 0.01102996491785656,
+ "acc_norm,none": 0.3566507684154743,
+ "acc_norm_stderr,none": 0.01102996491785656
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617648.4240222,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "8dd4f73e94b492d082eebafc44fe527d605540255eaf869f23c7d51e4ffb37c4"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1582935.013239375,
+ "end_time": 1583016.72552446,
+ "total_evaluation_time_seconds": "81.71228508488275"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/exams_ar_5_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a0fa776975a348baa84fc230ce12bb1c5e150ead
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/exams_ar_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.3407821229050279,
+ "acc_stderr,none": 0.02047248187699896,
+ "acc_norm,none": 0.3407821229050279,
+ "acc_norm_stderr,none": 0.02047248187699896
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "112b79143",
+ "date": 1739211970.5611851,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1322404.630248276,
+ "end_time": 1322480.6699447,
+ "total_evaluation_time_seconds": "76.03969642403536"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/gat_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..fe11b469ceb33aba7237bcc6317cc752412887a3
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.2664618086040386,
+ "acc_stderr,none": 0.003495353970358859,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.24935064935064935,
+ "acc_stderr,none": 0.008335372497778036
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.2983606557377049,
+ "acc_stderr,none": 0.00873445255221157
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.25874125874125875,
+ "acc_stderr,none": 0.008403358167147365
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.19138755980861244,
+ "acc_stderr,none": 0.012175219862346352
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.30573770491803276,
+ "acc_stderr,none": 0.013195760894549713
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.27603305785123966,
+ "acc_stderr,none": 0.012856618756239491
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.2561349693251534,
+ "acc_stderr,none": 0.012092310807729188
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.25205479452054796,
+ "acc_stderr,none": 0.022757873597035808
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.2729678638941399,
+ "acc_stderr,none": 0.008663668753419975
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.2664618086040386,
+ "acc_stderr,none": 0.003495353970358859,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "112b79143",
+ "date": 1739240499.1300695,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1350933.020023772,
+ "end_time": 1351628.555126437,
+ "total_evaluation_time_seconds": "695.5351026649587"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_mcq_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3abfb4fe17ef03fa233994b166d12a1ab5b54516
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.5359359359359359,
+ "acc_stderr,none": 0.004989814518061573,
+ "acc_norm,none": 0.5359359359359359,
+ "acc_norm_stderr,none": 0.004989814518061573
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617801.8553765,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "7b5b044e4260d8f2ccd928941529cc6f13c02303af5ed0b926cb22069d0a3368"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1583088.400197009,
+ "end_time": 1583390.481922052,
+ "total_evaluation_time_seconds": "302.08172504301183"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_tf_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..90fca0f18d15f518e85fbc86b36b7eb4c1d20bd6
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6340374377468658,
+ "acc_stderr,none": 0.0063130565613714554,
+ "acc_norm,none": 0.6340374377468658,
+ "acc_norm_stderr,none": 0.0063130565613714554
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618159.7425826,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "87c1341e70cacc508279240f78ecd4d5d873569e238982ef3f15031c20f834da"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1583446.306152082,
+ "end_time": 1583776.933878196,
+ "total_evaluation_time_seconds": "330.62772611388937"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-7B-Instruct-v0.3/openaimmlu_0_shot.json b/evaluations/ar/Mistral-7B-Instruct-v0.3/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..eb7882dab98468842e8d054ce1c4de6eec676862
--- /dev/null
+++ b/evaluations/ar/Mistral-7B-Instruct-v0.3/openaimmlu_0_shot.json
@@ -0,0 +1,2660 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.3230309072781655,
+ "acc_stderr,none": 0.0039276388831554045,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.30066225165562915,
+ "acc_stderr,none": 0.008338606312023163,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.24,
+ "acc_stderr,none": 0.04292346959909284
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.28289473684210525,
+ "acc_stderr,none": 0.03665349695640767
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.3055555555555556,
+ "acc_stderr,none": 0.03852084696008534
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.23,
+ "acc_stderr,none": 0.04229525846816506
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.04512608598542127
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.24509803921568626,
+ "acc_stderr,none": 0.04280105837364396
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.3191489361702128,
+ "acc_stderr,none": 0.030472973363380052
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.32456140350877194,
+ "acc_stderr,none": 0.04404556157374767
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.2896551724137931,
+ "acc_stderr,none": 0.03780019230438015
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.31746031746031744,
+ "acc_stderr,none": 0.023973861998992086
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.3161290322580645,
+ "acc_stderr,none": 0.02645087448904276
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.3103448275862069,
+ "acc_stderr,none": 0.03255086769970103
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.0279404571362284
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.271523178807947,
+ "acc_stderr,none": 0.03631329803969654
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.029531221160930918
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.36585365853658536,
+ "acc_stderr,none": 0.011300445088563829,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.3575757575757576,
+ "acc_stderr,none": 0.03742597043806586
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.29411764705882354,
+ "acc_stderr,none": 0.03198001660115071
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.4092827004219409,
+ "acc_stderr,none": 0.032007041833595914
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.4793388429752066,
+ "acc_stderr,none": 0.04560456086387235
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.4537037037037037,
+ "acc_stderr,none": 0.04812917324536823
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.3496932515337423,
+ "acc_stderr,none": 0.03746668325470021
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.3633440514469453,
+ "acc_stderr,none": 0.027316847674192714
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.36419753086419754,
+ "acc_stderr,none": 0.026774929899722327
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.28654970760233917,
+ "acc_stderr,none": 0.034678266857038266
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3186109238031018,
+ "acc_stderr,none": 0.006039269206309317,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.26666666666666666,
+ "acc_stderr,none": 0.038201699145179055
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.3132075471698113,
+ "acc_stderr,none": 0.02854479331905533
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.2832369942196532,
+ "acc_stderr,none": 0.034355680560478746
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.2698412698412698,
+ "acc_stderr,none": 0.03970158273235173
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.04824181513244218
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.35858585858585856,
+ "acc_stderr,none": 0.03416903640391521
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.28256880733944956,
+ "acc_stderr,none": 0.01930424349770715
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.3632286995515695,
+ "acc_stderr,none": 0.032277904428505
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.33035714285714285,
+ "acc_stderr,none": 0.044642857142857116
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.38569604086845466,
+ "acc_stderr,none": 0.017406476619212914
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.35294117647058826,
+ "acc_stderr,none": 0.027363593284684937
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3262411347517731,
+ "acc_stderr,none": 0.02796845304356316
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.30964797913950454,
+ "acc_stderr,none": 0.01180859826250332
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.2610294117647059,
+ "acc_stderr,none": 0.026679252270103135
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.29248366013071897,
+ "acc_stderr,none": 0.01840341571010978
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.2891566265060241,
+ "acc_stderr,none": 0.03529486801511115
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.3280584297017651,
+ "acc_stderr,none": 0.008100558505292763,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.045126085985421296
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.32124352331606215,
+ "acc_stderr,none": 0.033699508685490674
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.3230769230769231,
+ "acc_stderr,none": 0.023710888501970555
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.2857142857142857,
+ "acc_stderr,none": 0.029344572500634342
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.2595419847328244,
+ "acc_stderr,none": 0.03844876139785271
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.4077669902912621,
+ "acc_stderr,none": 0.048657775704107696
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.4358974358974359,
+ "acc_stderr,none": 0.032485775115784
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.3468208092485549,
+ "acc_stderr,none": 0.025624723994030457
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2424581005586592,
+ "acc_stderr,none": 0.014333522059217887
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.42727272727272725,
+ "acc_stderr,none": 0.04738198703545483
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.3877551020408163,
+ "acc_stderr,none": 0.031192230726795656
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.42786069651741293,
+ "acc_stderr,none": 0.03498541988407795
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.51,
+ "acc_stderr,none": 0.05024183937956914
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.3230309072781655,
+ "acc_stderr,none": 0.0039276388831554045,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.30066225165562915,
+ "acc_stderr,none": 0.008338606312023163,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.36585365853658536,
+ "acc_stderr,none": 0.011300445088563829,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3186109238031018,
+ "acc_stderr,none": 0.006039269206309317,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.3280584297017651,
+ "acc_stderr,none": 0.008100558505292763,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_international_law",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_logical_fallacies"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_sociology",
+ "openaimmlu_business_ethics",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_marketing",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_management",
+ "openaimmlu_public_relations",
+ "openaimmlu_security_studies"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_medicine",
+ "openaimmlu_professional_law",
+ "openaimmlu_human_aging",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_nutrition",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_virology",
+ "openaimmlu_machine_learning",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_anatomy",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_formal_logic",
+ "openaimmlu_global_facts"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_astronomy",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_college_physics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_econometrics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-7B-Instruct-v0.3,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7248023552,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "e0bc86c23ce5aae1db576c8cca6f06f1f73af2db",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "112b79143",
+ "date": 1739222548.6378462,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
+ "model_name_sanitized": "mistralai__Mistral-7B-Instruct-v0.3",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1332982.54396398,
+ "end_time": 1333764.311185857,
+ "total_evaluation_time_seconds": "781.7672218771186"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/acva_5_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f3c4b722bf78c6e3fd0699fbd1f832268e446391
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7742824339839265,
+ "acc_stderr,none": 0.004479692846303672,
+ "acc_norm,none": 0.7692307692307693,
+ "acc_norm_stderr,none": 0.004514744002858174
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969133.0360518,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4322.019001477,
+ "end_time": 5037.885975796,
+ "total_evaluation_time_seconds": "715.8669743190003"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/ar_ifeval_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..411329acef82f8dcf2b1487e9513561eee2229df
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.3582089552238806,
+ "prompt_level_strict_acc_stderr,none": 0.020729467924035978,
+ "inst_level_strict_acc,none": 0.70580204778157,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.44402985074626866,
+ "prompt_level_loose_acc_stderr,none": 0.021481021503779226,
+ "inst_level_loose_acc,none": 0.7631399317406143,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619352.6594934,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "ar_ifeval": "f326b8a98c506486038a589a169e687707c38c2ea33f7dd1189337e8bafb199b"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1394036.696567707,
+ "end_time": 1403362.389299741,
+ "total_evaluation_time_seconds": "9325.692732034018"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/araMath_v3_5_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..59e5c2f63d24c87f263c514eb54f6be898d1c262
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.01993366482555282,
+ "acc_norm,none": 0.4,
+ "acc_norm_stderr,none": 0.01993366482555282
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619227.1134682,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "araMath_v3": "3fff45213e85bf51326ed6c644cc5e49da5f0dc899148eedf05f142fb3a2e9d7"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1393911.575720478,
+ "end_time": 1393979.700059605,
+ "total_evaluation_time_seconds": "68.1243391269818"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/araPro_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a7a34df395a8db7f42a315adf518115472e759f2
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5760847830433913,
+ "acc_stderr,none": 0.006988720995850974,
+ "acc_norm,none": 0.5760847830433913,
+ "acc_norm_stderr,none": 0.006988720995850974
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617064.9446375,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "araPro": "c501abfa12db371c0936f3cfe29510e3ea50fba562223331bd89379a5f2e9338"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1391749.239940259,
+ "end_time": 1392584.762478395,
+ "total_evaluation_time_seconds": "835.5225381359924"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/arabicmmlu_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..661c746ac626804c02bef0c5afc3c93dd3aca0b2
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5597371151850571,
+ "acc_stderr,none": 0.0040439126901054235,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5504410143329658,
+ "acc_stderr,none": 0.00803729411502819,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.018057877962865322
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.5808383233532934,
+ "acc_stderr,none": 0.027039353229234966
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.5641025641025641,
+ "acc_stderr,none": 0.08044135838502685
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.4084507042253521,
+ "acc_stderr,none": 0.01946054309035929
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.5862068965517241,
+ "acc_stderr,none": 0.034653044884067966
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6218487394957983,
+ "acc_stderr,none": 0.031499305777849054
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.5980392156862745,
+ "acc_stderr,none": 0.048786087144669955
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7067067067067067,
+ "acc_stderr,none": 0.014411374425367092
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.4585987261146497,
+ "acc_stderr,none": 0.028164619599608254
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5492102065613609,
+ "acc_stderr,none": 0.011990225919534903,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6323529411764706,
+ "acc_stderr,none": 0.019506291693954857
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5863013698630137,
+ "acc_stderr,none": 0.02581379186479425
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.358974358974359,
+ "acc_stderr,none": 0.02432173848460235
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.5925925925925926,
+ "acc_stderr,none": 0.09636202008710973
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5833333333333334,
+ "acc_stderr,none": 0.031118303728104594
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6183574879227053,
+ "acc_stderr,none": 0.009672265032168954,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6680429397192403,
+ "acc_stderr,none": 0.013537873730119571
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.53125,
+ "acc_stderr,none": 0.01698692283813318
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6569767441860465,
+ "acc_stderr,none": 0.03630268317574833
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6481481481481481,
+ "acc_stderr,none": 0.03763605762486388
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.05479966243511907
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.553082191780822,
+ "acc_stderr,none": 0.00831484343018422,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.47126436781609193,
+ "acc_stderr,none": 0.05382727149237504
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5611111111111111,
+ "acc_stderr,none": 0.026191146099013147
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.4836223506743738,
+ "acc_stderr,none": 0.015518420714993047
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5127118644067796,
+ "acc_stderr,none": 0.032605860881808425
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.6781609195402298,
+ "acc_stderr,none": 0.05037749206122547
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.625,
+ "acc_stderr,none": 0.029408372932278746
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5435684647302904,
+ "acc_stderr,none": 0.03215209874442138
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.06299407883487118
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.6624113475177305,
+ "acc_stderr,none": 0.01782261691155253
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5675675675675675,
+ "acc_stderr,none": 0.05798377475143102
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.45255474452554745,
+ "acc_stderr,none": 0.04268118366696233
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.49047619047619045,
+ "acc_stderr,none": 0.034579448570031264
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5374256185405575,
+ "acc_stderr,none": 0.008583248393590412,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.44996451383960256,
+ "acc_stderr,none": 0.013258157065811954
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.03081667756806828
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.4196078431372549,
+ "acc_stderr,none": 0.030964616656831884
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8888888888888888,
+ "acc_stderr,none": 0.06163335513613659
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.6942148760330579,
+ "acc_stderr,none": 0.02967881888073462
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6526315789473685,
+ "acc_stderr,none": 0.03463365347393426
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5256723716381418,
+ "acc_stderr,none": 0.024721038181293356
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.7678571428571429,
+ "acc_stderr,none": 0.023067231459910752
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.640625,
+ "acc_stderr,none": 0.060451293443302384
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5597371151850571,
+ "acc_stderr,none": 0.0040439126901054235,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5504410143329658,
+ "acc_stderr,none": 0.00803729411502819,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5492102065613609,
+ "acc_stderr,none": 0.011990225919534903,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6183574879227053,
+ "acc_stderr,none": 0.009672265032168954,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.553082191780822,
+ "acc_stderr,none": 0.00831484343018422,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5374256185405575,
+ "acc_stderr,none": 0.008583248393590412,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_high_arabic_language"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_high_physics",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_primary_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_univ_political_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_univ_management"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735737831.1203127,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 100506.035146164,
+ "end_time": 101070.123980783,
+ "total_evaluation_time_seconds": "564.088834619004"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/etec_v2_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a504d39e0add147913e87de29953ea0eed3be8eb
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.492845786963434,
+ "acc_stderr,none": 0.011512103852890532,
+ "acc_norm,none": 0.492845786963434,
+ "acc_norm_stderr,none": 0.011512103852890532
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617957.7964923,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "etec_v2": "e06d601415c83f4efd3319516e349cd6cfb9329222e71456a9d89dce2525be0f"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1392642.204060316,
+ "end_time": 1392751.762366377,
+ "total_evaluation_time_seconds": "109.55830606096424"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/exams_ar_5_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..456a90bc2c440311c97da2156004491351499d9b
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4748603351955307,
+ "acc_stderr,none": 0.0215693950041748,
+ "acc_norm,none": 0.4748603351955307,
+ "acc_norm_stderr,none": 0.0215693950041748
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737022373.3396137,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4543.296069537,
+ "end_time": 5184.026563092,
+ "total_evaluation_time_seconds": "640.7304935550001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/gat_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..353c8742d5f4fd1c81acdfe04d50e830665e63eb
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.2544211714536561,
+ "acc_stderr,none": 0.0034266849246390933,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.21818181818181817,
+ "acc_stderr,none": 0.007957256646112694
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.22768670309653916,
+ "acc_stderr,none": 0.008005224886568718
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.2388663967611336,
+ "acc_stderr,none": 0.008181691396125238
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.19904306220095694,
+ "acc_stderr,none": 0.012357421397385122
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.18934426229508197,
+ "acc_stderr,none": 0.011221281369022177
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.3074380165289256,
+ "acc_stderr,none": 0.01327073443676181
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.2837423312883436,
+ "acc_stderr,none": 0.012488908992810271
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.2821917808219178,
+ "acc_stderr,none": 0.02358987837397864
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.3444234404536862,
+ "acc_stderr,none": 0.009241177951937967
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.2544211714536561,
+ "acc_stderr,none": 0.0034266849246390933,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735737160.254528,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 99835.105661851,
+ "end_time": 100475.795755295,
+ "total_evaluation_time_seconds": "640.6900934439909"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_mcq_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ee005e4df541fb3bebab96f40678514134f9bf1b
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.6842842842842843,
+ "acc_stderr,none": 0.0046505613370222115,
+ "acc_norm,none": 0.6842842842842843,
+ "acc_norm_stderr,none": 0.0046505613370222115
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618126.3322697,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_mcq": "e05a3d8b5f495479981b5fde66a4e065b41dec7a24c3efeb64d267eaf3c23cbd"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1392810.603565529,
+ "end_time": 1393281.038045333,
+ "total_evaluation_time_seconds": "470.43447980401106"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_tf_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3490873b6d803ce7ee8f2acb9e4af6fdf7449c95
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7178430362356174,
+ "acc_stderr,none": 0.005898261619714902,
+ "acc_norm,none": 0.7178430362356174,
+ "acc_norm_stderr,none": 0.005898261619714902
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618656.899603,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {
+ "moe_ien_tf": "ff025e68710a3689e092aa2517e40514bb9f34f121dd37f9dcb54e7db60b2810"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS][\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST]\" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST]\" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}\n {{- \"[TOOL_CALLS][\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- message[\"content\"] + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS]{\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e4676cb56dffea7782fd3e2b577cfaf1e123537e6ef49b3ec7caa6c095c62272",
+ "start_time": 1393341.259869937,
+ "end_time": 1393849.086177709,
+ "total_evaluation_time_seconds": "507.8263077719603"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Nemo-Instruct-2407/openaimmlu_0_shot.json b/evaluations/ar/Mistral-Nemo-Instruct-2407/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9d8de7de1611910ffe136035d4397ad41c93a8b2
--- /dev/null
+++ b/evaluations/ar/Mistral-Nemo-Instruct-2407/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.4615439396097422,
+ "acc_stderr,none": 0.004090287961453241,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.4198675496688742,
+ "acc_stderr,none": 0.008819083118680756,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.24,
+ "acc_stderr,none": 0.042923469599092816
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5197368421052632,
+ "acc_stderr,none": 0.04065771002562603
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.4652777777777778,
+ "acc_stderr,none": 0.041711158581816184
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.04852365870939099
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.27,
+ "acc_stderr,none": 0.044619604333847394
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.28431372549019607,
+ "acc_stderr,none": 0.04488482852329017
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.52,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.4297872340425532,
+ "acc_stderr,none": 0.03236214467715564
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.044346007015849245
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5241379310344828,
+ "acc_stderr,none": 0.0416180850350153
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.3835978835978836,
+ "acc_stderr,none": 0.025043757318520196
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5935483870967742,
+ "acc_stderr,none": 0.027941727346256308
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.43349753694581283,
+ "acc_stderr,none": 0.03486731727419872
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.57,
+ "acc_stderr,none": 0.04975698519562428
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.2962962962962963,
+ "acc_stderr,none": 0.02784081149587193
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.3443708609271523,
+ "acc_stderr,none": 0.038796870240733264
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.03388857118502325
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5720620842572062,
+ "acc_stderr,none": 0.011582619725483814,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.6606060606060606,
+ "acc_stderr,none": 0.03697442205031595
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6176470588235294,
+ "acc_stderr,none": 0.03410785338904719
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.6624472573839663,
+ "acc_stderr,none": 0.03078154910202622
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.628099173553719,
+ "acc_stderr,none": 0.04412015806624505
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5648148148148148,
+ "acc_stderr,none": 0.04792898170907062
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.4723926380368098,
+ "acc_stderr,none": 0.03922378290610991
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.5241157556270096,
+ "acc_stderr,none": 0.028365041542564577
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.5277777777777778,
+ "acc_stderr,none": 0.027777777777777797
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.5380116959064327,
+ "acc_stderr,none": 0.03823727092882307
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.44622387053270396,
+ "acc_stderr,none": 0.0063302986349148774,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.04292596718256981
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5094339622641509,
+ "acc_stderr,none": 0.0307673947078081
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.41040462427745666,
+ "acc_stderr,none": 0.03750757044895537
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.2619047619047619,
+ "acc_stderr,none": 0.03932537680392871
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.5858585858585859,
+ "acc_stderr,none": 0.035094383488796295
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.5614678899082569,
+ "acc_stderr,none": 0.021274713073954565
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.47085201793721976,
+ "acc_stderr,none": 0.03350073248773404
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.24107142857142858,
+ "acc_stderr,none": 0.04059867246952685
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.5925925925925926,
+ "acc_stderr,none": 0.017570705239256555
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5294117647058824,
+ "acc_stderr,none": 0.02858034106513829
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.30851063829787234,
+ "acc_stderr,none": 0.027553366165101362
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3546284224250326,
+ "acc_stderr,none": 0.012218576439090169
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.44485294117647056,
+ "acc_stderr,none": 0.03018753206032938
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.42483660130718953,
+ "acc_stderr,none": 0.01999797303545833
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.43373493975903615,
+ "acc_stderr,none": 0.03858158940685517
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.46682897139379187,
+ "acc_stderr,none": 0.008294155824875415,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.49,
+ "acc_stderr,none": 0.05024183937956912
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.6373056994818653,
+ "acc_stderr,none": 0.03469713791704371
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.4512820512820513,
+ "acc_stderr,none": 0.02523038123893484
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.44537815126050423,
+ "acc_stderr,none": 0.0322841062671639
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.5114503816793893,
+ "acc_stderr,none": 0.043841400240780176
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.5436893203883495,
+ "acc_stderr,none": 0.049318019942204146
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6410256410256411,
+ "acc_stderr,none": 0.03142616993791924
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.4884393063583815,
+ "acc_stderr,none": 0.026911898686377913
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.24692737430167597,
+ "acc_stderr,none": 0.01442229220480885
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5727272727272728,
+ "acc_stderr,none": 0.04738198703545483
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5918367346938775,
+ "acc_stderr,none": 0.03146465712827424
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.7064676616915423,
+ "acc_stderr,none": 0.03220024104534205
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.047258156262526066
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.4615439396097422,
+ "acc_stderr,none": 0.004090287961453241,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.4198675496688742,
+ "acc_stderr,none": 0.008819083118680756,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5720620842572062,
+ "acc_stderr,none": 0.011582619725483814,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.44622387053270396,
+ "acc_stderr,none": 0.0063302986349148774,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.46682897139379187,
+ "acc_stderr,none": 0.008294155824875415,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_international_law",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_european_history"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_marketing",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_business_ethics",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_security_studies",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_public_relations",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_management",
+ "openaimmlu_sociology",
+ "openaimmlu_human_sexuality"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_law",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_nutrition",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_formal_logic",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_anatomy",
+ "openaimmlu_global_facts",
+ "openaimmlu_machine_learning",
+ "openaimmlu_virology"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_physics",
+ "openaimmlu_college_biology",
+ "openaimmlu_computer_security",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_astronomy",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_conceptual_physics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Nemo-Instruct-2407,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 12247782400,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8aedd450f2583e9c67fae1929f6936b8fc5aef9c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969874.3072467,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Nemo-Instruct-2407",
+ "model_name_sanitized": "mistralai__Mistral-Nemo-Instruct-2407",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 5063.260085979,
+ "end_time": 5346.967923807,
+ "total_evaluation_time_seconds": "283.70783782800027"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/acva_5_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9efe8c54e5b8ec43fd0b11e0bf8d73f2eb39fc28
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7159586681974741,
+ "acc_stderr,none": 0.004832263417483554,
+ "acc_norm,none": 0.6893226176808266,
+ "acc_norm_stderr,none": 0.004958861031051597
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969697.6002197,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 5310.719588598,
+ "end_time": 7490.179107189,
+ "total_evaluation_time_seconds": "2179.4595185910002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/ar_ifeval_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a99a5dd2678ef45303a0a7ddde31d0d8fa473bc4
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.5111940298507462,
+ "prompt_level_strict_acc_stderr,none": 0.021611466915389024,
+ "inst_level_strict_acc,none": 0.7815699658703071,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.6436567164179104,
+ "prompt_level_loose_acc_stderr,none": 0.020705444127112654,
+ "inst_level_loose_acc,none": 0.8430034129692833,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619509.695591,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "04f79d36c1f856a7e0d2a4cc61bd745f1fdc633ccba1d094088f415f6471654b"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1461935.69256131,
+ "end_time": 1471595.726226262,
+ "total_evaluation_time_seconds": "9660.033664952032"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/araMath_v3_5_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..616c59caf9e07b18882966223ce1d08e883770d2
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.4446280991735537,
+ "acc_stderr,none": 0.020219570899233173,
+ "acc_norm,none": 0.4446280991735537,
+ "acc_norm_stderr,none": 0.020219570899233173
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619380.3911364,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "8745758588621a4626b1d9dd0d3b59d90cdd106860afa2362c8e0cd8b77bd38a"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1461806.514496169,
+ "end_time": 1461868.915775248,
+ "total_evaluation_time_seconds": "62.40127907902934"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/araPro_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0b59e313b8aed660d0fe64bb7247f13d94c7fa6d
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.47730453909218157,
+ "acc_stderr,none": 0.007063779668905028,
+ "acc_norm,none": 0.47730453909218157,
+ "acc_norm_stderr,none": 0.007063779668905028
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617068.7956502,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "7ae4350d99b977b9fbeea4421304e875323416c6b521abf45bd0eb9782f969b5"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1459495.184806751,
+ "end_time": 1460928.893959109,
+ "total_evaluation_time_seconds": "1433.7091523578856"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/arabicmmlu_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..57aec1a1efbd911d4e9455a56547579f5232935c
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5043237634036666,
+ "acc_stderr,none": 0.004042363470895757,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5052370452039692,
+ "acc_stderr,none": 0.00790960602679391,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.3368421052631579,
+ "acc_stderr,none": 0.017155396919294835
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6407185628742516,
+ "acc_stderr,none": 0.026292321014549997
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.48717948717948717,
+ "acc_stderr,none": 0.08108404256842
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.3317683881064163,
+ "acc_stderr,none": 0.018641062838831428
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.49261083743842365,
+ "acc_stderr,none": 0.035176035403610084
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6134453781512605,
+ "acc_stderr,none": 0.03163145807552378
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.46078431372549017,
+ "acc_stderr,none": 0.04959859966384181
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.6926926926926927,
+ "acc_stderr,none": 0.014604660845760144
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.4681528662420382,
+ "acc_stderr,none": 0.028204284454138768
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.4775212636695018,
+ "acc_stderr,none": 0.012004811696820014,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.5980392156862745,
+ "acc_stderr,none": 0.01983517648437538
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.34794520547945207,
+ "acc_stderr,none": 0.024965874481689576
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.3641025641025641,
+ "acc_stderr,none": 0.02439667298509477
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.6296296296296297,
+ "acc_stderr,none": 0.09470524295495535
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5317460317460317,
+ "acc_stderr,none": 0.03149604347936578
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.5628019323671497,
+ "acc_stderr,none": 0.009820739967892693,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.620148637489678,
+ "acc_stderr,none": 0.01395282207034666
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.45023148148148145,
+ "acc_stderr,none": 0.016935673216772293
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.5930232558139535,
+ "acc_stderr,none": 0.03756839173779933
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6481481481481481,
+ "acc_stderr,none": 0.037636057624863876
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.05422675115236518
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.4994292237442922,
+ "acc_stderr,none": 0.008286856287550251,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.42528735632183906,
+ "acc_stderr,none": 0.05331106836455265
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5222222222222223,
+ "acc_stderr,none": 0.026362914614329245
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.3988439306358382,
+ "acc_stderr,none": 0.015205676046200057
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.3686440677966102,
+ "acc_stderr,none": 0.0314707306823461
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.6551724137931034,
+ "acc_stderr,none": 0.05125421389342353
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.5698529411764706,
+ "acc_stderr,none": 0.030074971917302875
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.4854771784232365,
+ "acc_stderr,none": 0.03226124401232391
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.543859649122807,
+ "acc_stderr,none": 0.0665577530069649
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.6524822695035462,
+ "acc_stderr,none": 0.017946778859462872
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5405405405405406,
+ "acc_stderr,none": 0.05832789513012364
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.48175182481751827,
+ "acc_stderr,none": 0.04284608260823147
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.4666666666666667,
+ "acc_stderr,none": 0.034508780443504965
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.47698089570936425,
+ "acc_stderr,none": 0.008646289649970346,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.38892831795599714,
+ "acc_stderr,none": 0.012992105378448731
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.49808429118773945,
+ "acc_stderr,none": 0.031008456046434162
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.3803921568627451,
+ "acc_stderr,none": 0.03046192691828629
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.09745089103411436
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.5495867768595041,
+ "acc_stderr,none": 0.03204905158847432
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7157894736842105,
+ "acc_stderr,none": 0.03280815673574656
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5232273838630807,
+ "acc_stderr,none": 0.02472696435617918
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.6488095238095238,
+ "acc_stderr,none": 0.02607999894833243
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.06299407883487121
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5043237634036666,
+ "acc_stderr,none": 0.004042363470895757,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5052370452039692,
+ "acc_stderr,none": 0.00790960602679391,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.4775212636695018,
+ "acc_stderr,none": 0.012004811696820014,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.5628019323671497,
+ "acc_stderr,none": 0.009820739967892693,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.4994292237442922,
+ "acc_stderr,none": 0.008286856287550251,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.47698089570936425,
+ "acc_stderr,none": 0.008646289649970346,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_math"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_primary_history"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_high_economics"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736972751.2143774,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 14232.929786561,
+ "end_time": 14765.426940165,
+ "total_evaluation_time_seconds": "532.4971536039993"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/etec_v2_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..054711cd3b3588dc76256f5e4b51d65215627812
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.40964493905670374,
+ "acc_stderr,none": 0.011323732409166355,
+ "acc_norm,none": 0.40964493905670374,
+ "acc_norm_stderr,none": 0.011323732409166355
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618555.909214,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "e77e8618d461a8245f026c3013170019168ca5e9431e9d9d1c176a55cdcf1552"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1460982.144801136,
+ "end_time": 1461066.334385176,
+ "total_evaluation_time_seconds": "84.18958403985016"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/exams_ar_5_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..cb773358d56fa3d70e1dacec4ba2a769d9db2a5a
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/exams_ar_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.38733705772811916,
+ "acc_stderr,none": 0.021041317803855382,
+ "acc_norm,none": 0.38733705772811916,
+ "acc_norm_stderr,none": 0.021041317803855382
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736970120.592902,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 11602.469319334,
+ "end_time": 12824.398025607,
+ "total_evaluation_time_seconds": "1221.928706273"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/gat_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3d124a646b7862c8bba036c363c792fae3c6397a
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.28816004013545715,
+ "acc_stderr,none": 0.003569513517176158,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.2593692022263451,
+ "acc_stderr,none": 0.008444254056089201
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.26520947176684884,
+ "acc_stderr,none": 0.008427218151737142
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.27972027972027974,
+ "acc_stderr,none": 0.008612865946138122
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.27177033492822966,
+ "acc_stderr,none": 0.01376844704683984
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.24508196721311476,
+ "acc_stderr,none": 0.012319801935808129
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.2983471074380165,
+ "acc_stderr,none": 0.013158576974400435
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.25766871165644173,
+ "acc_stderr,none": 0.012115951274247083
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.2958904109589041,
+ "acc_stderr,none": 0.023924060011244693
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.3856332703213611,
+ "acc_stderr,none": 0.009466084278454174
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.28816004013545715,
+ "acc_stderr,none": 0.003569513517176158,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=False,pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,mm=False",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1730953375.739498,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1073-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 550.90.07\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 25487.850067782,
+ "end_time": 28449.915428973,
+ "total_evaluation_time_seconds": "2962.0653611909984"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_mcq_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c09dcfaf594319226a662e5611f5b4e0dcb4333e
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.6064064064064064,
+ "acc_stderr,none": 0.004888154163260656,
+ "acc_norm,none": 0.6064064064064064,
+ "acc_norm_stderr,none": 0.004888154163260656
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618710.0175338,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "c2a20c63c9048b05e61ad12ca87f357a5e71433c713f9a22b7d537ed6bc7421d"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1461136.332656852,
+ "end_time": 1461391.40888449,
+ "total_evaluation_time_seconds": "255.07622763793916"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_tf_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d76b0d5ab3e33714b37e11931a85120d283e7242
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6366134295036923,
+ "acc_stderr,none": 0.006303564979129615,
+ "acc_norm,none": 0.6366134295036923,
+ "acc_norm_stderr,none": 0.006303564979129615
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619032.2719598,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "7ae232d555f937b86ad5bf27c5a3ce636c0d7e695241e997cf20910ab8e3e678"
+ },
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if messages[0][\"role\"] == \"system\" %}\n {%- set system_message = messages[0][\"content\"] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n{%- set user_messages = loop_messages | selectattr(\"role\", \"equalto\", \"user\") | list %}\n\n{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}\n{%- set ns = namespace() %}\n{%- set ns.index = 0 %}\n{%- for message in loop_messages %}\n {%- if not (message.role == \"tool\" or message.role == \"tool_results\" or (message.tool_calls is defined and message.tool_calls is not none)) %}\n {%- if (message[\"role\"] == \"user\") != (ns.index % 2 == 0) %}\n {{- raise_exception(\"After the optional system message, conversation roles must alternate user/assistant/user/assistant/...\") }}\n {%- endif %}\n {%- set ns.index = ns.index + 1 %}\n {%- endif %}\n{%- endfor %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n {%- if message[\"role\"] == \"user\" %}\n {%- if tools is not none and (message == user_messages[-1]) %}\n {{- \"[AVAILABLE_TOOLS] [\" }}\n {%- for tool in tools %}\n {%- set tool = tool.function %}\n {{- '{\"type\": \"function\", \"function\": {' }}\n {%- for key, val in tool.items() if key != \"return\" %}\n {%- if val is string %}\n {{- '\"' + key + '\": \"' + val + '\"' }}\n {%- else %}\n {{- '\"' + key + '\": ' + val|tojson }}\n {%- endif %}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \"}}\" }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" }}\n {%- endif %}\n {%- endfor %}\n {{- \"[/AVAILABLE_TOOLS]\" }}\n {%- endif %}\n {%- if loop.last and system_message is defined %}\n {{- \"[INST] \" + system_message + \"\\n\\n\" + message[\"content\"] + \"[/INST]\" }}\n {%- else %}\n {{- \"[INST] \" + message[\"content\"] + \"[/INST]\" }}\n {%- endif %}\n {%- elif message.tool_calls is defined and message.tool_calls is not none %}\n {{- \"[TOOL_CALLS] [\" }}\n {%- for tool_call in message.tool_calls %}\n {%- set out = tool_call.function|tojson %}\n {{- out[:-1] }}\n {%- if not tool_call.id is defined or tool_call.id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- ', \"id\": \"' + tool_call.id + '\"}' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- else %}\n {{- \"]\" + eos_token }}\n {%- endif %}\n {%- endfor %}\n {%- elif message[\"role\"] == \"assistant\" %}\n {{- \" \" + message[\"content\"]|trim + eos_token}}\n {%- elif message[\"role\"] == \"tool_results\" or message[\"role\"] == \"tool\" %}\n {%- if message.content is defined and message.content.content is defined %}\n {%- set content = message.content.content %}\n {%- else %}\n {%- set content = message.content %}\n {%- endif %}\n {{- '[TOOL_RESULTS] {\"content\": ' + content|string + \", \" }}\n {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}\n {{- raise_exception(\"Tool call IDs should be alphanumeric strings with length 9!\") }}\n {%- endif %}\n {{- '\"call_id\": \"' + message.tool_call_id + '\"}[/TOOL_RESULTS]' }}\n {%- else %}\n {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n {%- endif %}\n{%- endfor %}\n",
+ "chat_template_sha": "e16746b40344d6c5b5265988e0328a0bf7277be86f1c335156eae07e29c82826",
+ "start_time": 1461458.587731334,
+ "end_time": 1461738.022823052,
+ "total_evaluation_time_seconds": "279.4350917181"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Mistral-Small-Instruct-2409/openaimmlu_0_shot.json b/evaluations/ar/Mistral-Small-Instruct-2409/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d8ef5e15d71f7f25a8b521a7c95690b17da5e8c9
--- /dev/null
+++ b/evaluations/ar/Mistral-Small-Instruct-2409/openaimmlu_0_shot.json
@@ -0,0 +1,2655 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.36258278145695366,
+ "acc_stderr,none": 0.0086843758586097,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.31,
+ "acc_stderr,none": 0.04648231987117316
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.45394736842105265,
+ "acc_stderr,none": 0.04051646342874142
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.2847222222222222,
+ "acc_stderr,none": 0.03773809990686934
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.045126085985421276
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.23529411764705882,
+ "acc_stderr,none": 0.04220773659171452
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.3276595744680851,
+ "acc_stderr,none": 0.030683020843231004
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.044346007015849245
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.43448275862068964,
+ "acc_stderr,none": 0.041307408795554966
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.35978835978835977,
+ "acc_stderr,none": 0.024718075944129277
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.45483870967741935,
+ "acc_stderr,none": 0.028327743091561063
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.41379310344827586,
+ "acc_stderr,none": 0.03465304488406796
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.53,
+ "acc_stderr,none": 0.05016135580465919
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.32592592592592595,
+ "acc_stderr,none": 0.028578348365473072
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.304635761589404,
+ "acc_stderr,none": 0.03757949922943343
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.32407407407407407,
+ "acc_stderr,none": 0.03191923445686185
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.46286031042128606,
+ "acc_stderr,none": 0.01162125734036281,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.5333333333333333,
+ "acc_stderr,none": 0.03895658065271846
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.46078431372549017,
+ "acc_stderr,none": 0.03498501649369527
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.569620253164557,
+ "acc_stderr,none": 0.03223017195937597
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6363636363636364,
+ "acc_stderr,none": 0.043913262867240704
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.4074074074074074,
+ "acc_stderr,none": 0.04750077341199984
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.36809815950920244,
+ "acc_stderr,none": 0.03789213935838396
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.44694533762057875,
+ "acc_stderr,none": 0.028237769422085342
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.38580246913580246,
+ "acc_stderr,none": 0.02708540122613214
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.4269005847953216,
+ "acc_stderr,none": 0.03793620616529917
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.37306136210384355,
+ "acc_stderr,none": 0.006247720787955081,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.31851851851851853,
+ "acc_stderr,none": 0.040247784019771096
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.4226415094339623,
+ "acc_stderr,none": 0.030402331445769537
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.3179190751445087,
+ "acc_stderr,none": 0.0355068398916558
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.30158730158730157,
+ "acc_stderr,none": 0.04104947269903394
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.035402943770953675
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.3724770642201835,
+ "acc_stderr,none": 0.020728368457638497
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.38565022421524664,
+ "acc_stderr,none": 0.03266842214289201
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.2767857142857143,
+ "acc_stderr,none": 0.04246624336697627
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.04512608598542128
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.4648786717752235,
+ "acc_stderr,none": 0.01783579880629064
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.028452639985088016
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.32978723404255317,
+ "acc_stderr,none": 0.028045946942042415
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.34419817470664926,
+ "acc_stderr,none": 0.012134433741002575
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.2757352941176471,
+ "acc_stderr,none": 0.027146271936625166
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.3758169934640523,
+ "acc_stderr,none": 0.019594021136577447
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.3795180722891566,
+ "acc_stderr,none": 0.03777798822748018
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.43274497869750456,
+ "acc_stderr,none": 0.008402070332370153,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.41450777202072536,
+ "acc_stderr,none": 0.03555300319557673
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.3923076923076923,
+ "acc_stderr,none": 0.02475600038213095
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.3949579831932773,
+ "acc_stderr,none": 0.031753678460966245
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.48091603053435117,
+ "acc_stderr,none": 0.04382094705550988
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.44660194174757284,
+ "acc_stderr,none": 0.04922424153458933
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6282051282051282,
+ "acc_stderr,none": 0.03166098891888078
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.4884393063583815,
+ "acc_stderr,none": 0.02691189868637792
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2748603351955307,
+ "acc_stderr,none": 0.01493131670322051
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5181818181818182,
+ "acc_stderr,none": 0.04785964010794916
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5673469387755102,
+ "acc_stderr,none": 0.03171752824062664
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6019900497512438,
+ "acc_stderr,none": 0.03461199429040013
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.04943110704237101
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.36258278145695366,
+ "acc_stderr,none": 0.0086843758586097,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.46286031042128606,
+ "acc_stderr,none": 0.01162125734036281,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.37306136210384355,
+ "acc_stderr,none": 0.006247720787955081,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.43274497869750456,
+ "acc_stderr,none": 0.008402070332370153,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_international_law",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_prehistory"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_human_sexuality",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_business_ethics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_security_studies",
+ "openaimmlu_sociology",
+ "openaimmlu_management",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_marketing",
+ "openaimmlu_public_relations"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_formal_logic",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_virology",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_human_aging",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_law",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_global_facts",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_anatomy",
+ "openaimmlu_nutrition"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_physics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_college_biology",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_econometrics",
+ "openaimmlu_college_chemistry"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=mistralai/Mistral-Small-Instruct-2409,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 22247282688,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "8012044390bdc1c6d8ab162f5416220f43bf517b",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736971899.4510105,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "mistralai/Mistral-Small-Instruct-2409",
+ "model_name_sanitized": "mistralai__Mistral-Small-Instruct-2409",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 7512.813621255,
+ "end_time": 8409.889614024,
+ "total_evaluation_time_seconds": "897.0759927689996"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/acva_5_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f7e35ba45fec2e7f0402fcc6f4d2b584a5c06f1a
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7539609644087256,
+ "acc_stderr,none": 0.004615218782337692,
+ "acc_norm,none": 0.7504018369690012,
+ "acc_norm_stderr,none": 0.004637495394808246
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736967158.9094276,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1905.493403773,
+ "end_time": 2957.044343774,
+ "total_evaluation_time_seconds": "1051.550940001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..30dfa9dbcafdef19b9834a22ae21c7af99d929e4
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,140 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.6865671641791045,
+ "prompt_level_strict_acc_stderr,none": 0.020055655889994813,
+ "inst_level_strict_acc,none": 0.8675767918088737,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.7798507462686567,
+ "prompt_level_loose_acc_stderr,none": 0.017913789384648014,
+ "inst_level_loose_acc,none": 0.9078498293515358,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737366501.2749803,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 1222795.529793559,
+ "end_time": 1224741.388765624,
+ "total_evaluation_time_seconds": "1945.858972064918"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7ee817cafcf836289ecdc3bda66f33448b88b3ac
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.828099173553719,
+ "acc_stderr,none": 0.015351884298423902,
+ "acc_norm,none": 0.828099173553719,
+ "acc_norm_stderr,none": 0.015351884298423902
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738685031.1295216,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "4afa6622c31e4fb937d7ad0da2119b52cd56b8bedea0f95cc12cc332c35e09f6"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 511945.35196303,
+ "end_time": 512044.172501626,
+ "total_evaluation_time_seconds": "98.82053859601729"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/araPro_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7fb57947702e9dae0d84b7421a1ce35a69d416be
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.6910617876424715,
+ "acc_stderr,none": 0.0065344532028759,
+ "acc_norm,none": 0.6910617876424715,
+ "acc_norm_stderr,none": 0.0065344532028759
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738745549.856135,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "59a5e15442970296d6c76ad4c1ea628b774166211f664b5c0f3eb594d33d6eb2"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 653962.315611087,
+ "end_time": 655012.793912456,
+ "total_evaluation_time_seconds": "1050.478301369003"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..77647d858060b6d7820a82d977f1b95bc42708e5
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6936008301625735,
+ "acc_stderr,none": 0.00373302587909067,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6827453142227122,
+ "acc_stderr,none": 0.007472393741912611,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5263157894736842,
+ "acc_stderr,none": 0.0181236958723731
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7125748502994012,
+ "acc_stderr,none": 0.02480021874723033
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.717948717948718,
+ "acc_stderr,none": 0.07299934324587597
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.5743348982785602,
+ "acc_stderr,none": 0.01957520354642272
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.7142857142857143,
+ "acc_stderr,none": 0.0317852971064275
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6974789915966386,
+ "acc_stderr,none": 0.029837962388291922
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.696078431372549,
+ "acc_stderr,none": 0.045766654032077636
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8438438438438438,
+ "acc_stderr,none": 0.011490669345809187
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.697452229299363,
+ "acc_stderr,none": 0.02596462432074243
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6980558930741191,
+ "acc_stderr,none": 0.010952159128929795,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7973856209150327,
+ "acc_stderr,none": 0.01626105528374612
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.7095890410958904,
+ "acc_stderr,none": 0.02379355080761079
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4948717948717949,
+ "acc_stderr,none": 0.025349672906838653
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.746031746031746,
+ "acc_stderr,none": 0.027474608338697432
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7270531400966184,
+ "acc_stderr,none": 0.008920558221864296,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.7563996696944674,
+ "acc_stderr,none": 0.012340191989229594
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6828703703703703,
+ "acc_stderr,none": 0.01584098369286431
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.7151162790697675,
+ "acc_stderr,none": 0.0345162887625062
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7345679012345679,
+ "acc_stderr,none": 0.034800041025035575
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.7733333333333333,
+ "acc_stderr,none": 0.04866999865182628
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6843607305936074,
+ "acc_stderr,none": 0.007708754356580086,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.47126436781609193,
+ "acc_stderr,none": 0.05382727149237504
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.6861111111111111,
+ "acc_stderr,none": 0.02449277389433383
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.6078998073217726,
+ "acc_stderr,none": 0.015160905911641495
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.6228813559322034,
+ "acc_stderr,none": 0.03161605923498462
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.8045977011494253,
+ "acc_stderr,none": 0.04275678110973871
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.7169117647058824,
+ "acc_stderr,none": 0.02736586113151381
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.6265560165975104,
+ "acc_stderr,none": 0.03122389407322075
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.8245614035087719,
+ "acc_stderr,none": 0.05082531275857955
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.8297872340425532,
+ "acc_stderr,none": 0.014164234541466977
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.7297297297297297,
+ "acc_stderr,none": 0.05197789984508372
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.635036496350365,
+ "acc_stderr,none": 0.041281418039994466
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.680952380952381,
+ "acc_stderr,none": 0.03224133248962465
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6877544628875666,
+ "acc_stderr,none": 0.0078686460877362,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.5592618878637331,
+ "acc_stderr,none": 0.013231119391259417
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.7279693486590039,
+ "acc_stderr,none": 0.027598075188734354
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.030738931174713525
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.9629629629629629,
+ "acc_stderr,none": 0.037037037037037035
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.8471074380165289,
+ "acc_stderr,none": 0.0231821603389708
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.02909571869813228
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.823960880195599,
+ "acc_stderr,none": 0.018855055239784486
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8720238095238095,
+ "acc_stderr,none": 0.018251827563156547
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.8125,
+ "acc_stderr,none": 0.0491747370293402
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6936008301625735,
+ "acc_stderr,none": 0.00373302587909067,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6827453142227122,
+ "acc_stderr,none": 0.007472393741912611,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6980558930741191,
+ "acc_stderr,none": 0.010952159128929795,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7270531400966184,
+ "acc_stderr,none": 0.008920558221864296,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6843607305936074,
+ "acc_stderr,none": 0.007708754356580086,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6877544628875666,
+ "acc_stderr,none": 0.0078686460877362,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_history"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_middle_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_primary_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736972201.2878518,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 7391.591328441,
+ "end_time": 7711.101377987,
+ "total_evaluation_time_seconds": "319.5100495460001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5de2e2dc28bca845bf2e12e1037e9095d6ed7ba1
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.7217806041335453,
+ "acc_stderr,none": 0.010318711283927943,
+ "acc_norm,none": 0.7217806041335453,
+ "acc_norm_stderr,none": 0.010318711283927943
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682542.2863889,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "ccf52ad4d1e05dccde272349596fb8819b25302b4afaa8ddefdc7288f9965839"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 509456.640224011,
+ "end_time": 509591.371451567,
+ "total_evaluation_time_seconds": "134.7312275560107"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..2cd48aa29b2f33774c435d435d5485b5aabe201f
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.5754189944134078,
+ "acc_stderr,none": 0.02134961180052154,
+ "acc_norm,none": 0.5754189944134078,
+ "acc_norm_stderr,none": 0.02134961180052154
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737022249.8453927,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1568.270723619,
+ "end_time": 2348.644455567,
+ "total_evaluation_time_seconds": "780.3737319480001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/gat_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f03a018abb6e2146aa04b0217b95d9ca7266de12
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.5169948576445503,
+ "acc_stderr,none": 0.003913114023230164,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.5053803339517625,
+ "acc_stderr,none": 0.00963265627008383
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.4240437158469945,
+ "acc_stderr,none": 0.009434263952899024
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.4762605815237394,
+ "acc_stderr,none": 0.009583299630884915
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.615311004784689,
+ "acc_stderr,none": 0.015057468843874143
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.48770491803278687,
+ "acc_stderr,none": 0.01431649836654981
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.6330578512396694,
+ "acc_stderr,none": 0.013861408073003083
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.4455521472392638,
+ "acc_stderr,none": 0.013769159018771772
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.5561643835616439,
+ "acc_stderr,none": 0.026041258579497174
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.6185255198487712,
+ "acc_stderr,none": 0.00944671538672554
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.5169948576445503,
+ "acc_stderr,none": 0.003913114023230164,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=False,pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,mm=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1730951159.8851488,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1073-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 550.90.07\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 313651.3740997,
+ "end_time": 315420.113389589,
+ "total_evaluation_time_seconds": "1768.7392898889957"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..92d08b190165c78c23860f5a48b96c7eb1c9c45b
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.8051051051051051,
+ "acc_stderr,none": 0.003963378191295148,
+ "acc_norm,none": 0.8051051051051051,
+ "acc_norm_stderr,none": 0.003963378191295148
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682853.2745113,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "ce48b9a14bd92b18b8dc937edb46c180c4856590e207dc535b0ed1f5e8d9a7a5"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 509767.49576599,
+ "end_time": 510330.11789255,
+ "total_evaluation_time_seconds": "562.6221265600179"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..00864f625e35985419665d8678d3c1a737acdc8d
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7764039155074703,
+ "acc_stderr,none": 0.005460593590321656,
+ "acc_norm,none": 0.7764039155074703,
+ "acc_norm_stderr,none": 0.005460593590321656
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738683577.060945,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "f4ddc3d519c912c82ff8c20b8732077ac9136d725beb5ceddd9896a9640d070e"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 510491.403241704,
+ "end_time": 511110.843864396,
+ "total_evaluation_time_seconds": "619.4406226919964"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-14B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Qwen2.5-14B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a7dc3b91bc445041206c01f752ab3991fe7fe8f7
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-14B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2655 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.6125827814569537,
+ "acc_stderr,none": 0.008598613803694075,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001975
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.756578947368421,
+ "acc_stderr,none": 0.034923496688842384
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.7013888888888888,
+ "acc_stderr,none": 0.03827052357950756
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.41,
+ "acc_stderr,none": 0.049431107042371025
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.52,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.04988876515698589
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.4411764705882353,
+ "acc_stderr,none": 0.04940635630605659
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.62,
+ "acc_stderr,none": 0.048783173121456316
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.6936170212765957,
+ "acc_stderr,none": 0.030135906478517563
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.49122807017543857,
+ "acc_stderr,none": 0.04702880432049615
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5241379310344828,
+ "acc_stderr,none": 0.041618085035015295
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.6904761904761905,
+ "acc_stderr,none": 0.023809523809523864
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.7677419354838709,
+ "acc_stderr,none": 0.024022256130308235
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.6009852216748769,
+ "acc_stderr,none": 0.03445487686264715
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.82,
+ "acc_stderr,none": 0.038612291966536955
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.4888888888888889,
+ "acc_stderr,none": 0.03047800981961583
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.48344370860927155,
+ "acc_stderr,none": 0.040802441856289715
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.6851851851851852,
+ "acc_stderr,none": 0.03167468706828978
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.7123059866962306,
+ "acc_stderr,none": 0.010563497467305187,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.793939393939394,
+ "acc_stderr,none": 0.03158415324047709
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7794117647058824,
+ "acc_stderr,none": 0.02910225438967409
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7848101265822784,
+ "acc_stderr,none": 0.02675082699467617
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7603305785123967,
+ "acc_stderr,none": 0.03896878985070416
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.7314814814814815,
+ "acc_stderr,none": 0.042844679680521934
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7177914110429447,
+ "acc_stderr,none": 0.03536117886664743
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.639871382636656,
+ "acc_stderr,none": 0.02726429759980402
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.6141975308641975,
+ "acc_stderr,none": 0.027085401226132143
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7192982456140351,
+ "acc_stderr,none": 0.034462962170884265
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.6031692515171949,
+ "acc_stderr,none": 0.00615858158492755,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.48148148148148145,
+ "acc_stderr,none": 0.043163785995113245
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.6528301886792452,
+ "acc_stderr,none": 0.029300101705549652
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6242774566473989,
+ "acc_stderr,none": 0.036928207672648664
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5634920634920635,
+ "acc_stderr,none": 0.04435932892851466
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7424242424242424,
+ "acc_stderr,none": 0.031156269519646847
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.7889908256880734,
+ "acc_stderr,none": 0.01749392240411265
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6502242152466368,
+ "acc_stderr,none": 0.03200736719484503
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.5267857142857143,
+ "acc_stderr,none": 0.04738975119274155
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.04725815626252609
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.7509578544061303,
+ "acc_stderr,none": 0.015464676163395976
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6862745098039216,
+ "acc_stderr,none": 0.026568921015457155
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.4716312056737589,
+ "acc_stderr,none": 0.029779450957303055
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.455019556714472,
+ "acc_stderr,none": 0.012718456618701773
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.6433823529411765,
+ "acc_stderr,none": 0.02909720956841196
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.619281045751634,
+ "acc_stderr,none": 0.01964380155792481
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4819277108433735,
+ "acc_stderr,none": 0.038899512528272166
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.6835057821059038,
+ "acc_stderr,none": 0.007900267253552388,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.73,
+ "acc_stderr,none": 0.044619604333847394
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8497409326424871,
+ "acc_stderr,none": 0.025787723180723882
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.7384615384615385,
+ "acc_stderr,none": 0.0222821412042044
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.7941176470588235,
+ "acc_stderr,none": 0.02626502460827588
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7175572519083969,
+ "acc_stderr,none": 0.03948406125768362
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6990291262135923,
+ "acc_stderr,none": 0.04541609446503948
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.782051282051282,
+ "acc_stderr,none": 0.027046857630716677
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.6271676300578035,
+ "acc_stderr,none": 0.02603389061357627
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.5251396648044693,
+ "acc_stderr,none": 0.01670135084268263
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.7090909090909091,
+ "acc_stderr,none": 0.04350271442923243
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7673469387755102,
+ "acc_stderr,none": 0.02704925791589618
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.746268656716418,
+ "acc_stderr,none": 0.030769444967296024
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.04020151261036846
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.6125827814569537,
+ "acc_stderr,none": 0.008598613803694075,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.7123059866962306,
+ "acc_stderr,none": 0.010563497467305187,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.6031692515171949,
+ "acc_stderr,none": 0.00615858158492755,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.6835057821059038,
+ "acc_stderr,none": 0.007900267253552388,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_world_religions",
+ "openaimmlu_international_law"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_human_sexuality",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_business_ethics",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_marketing",
+ "openaimmlu_public_relations",
+ "openaimmlu_security_studies",
+ "openaimmlu_management",
+ "openaimmlu_us_foreign_policy"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_professional_law",
+ "openaimmlu_machine_learning",
+ "openaimmlu_human_aging",
+ "openaimmlu_virology",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_anatomy",
+ "openaimmlu_global_facts",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_formal_logic",
+ "openaimmlu_college_medicine",
+ "openaimmlu_nutrition"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_college_biology",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_physics",
+ "openaimmlu_econometrics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_electrical_engineering"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-14B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 14770033664,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "cf98f3b3bbb457ad9e2bb7baf9a0125b6b88caa8",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736968234.9414365,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-14B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-14B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2981.768562537,
+ "end_time": 3362.632727306,
+ "total_evaluation_time_seconds": "380.8641647690001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/acva_5_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..bd1e6eaefab3d3cdc1b236682314ef2bc85c30fe
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.8026406429391504,
+ "acc_stderr,none": 0.004264865005473752,
+ "acc_norm,none": 0.7991963260619978,
+ "acc_norm_stderr,none": 0.004292679074358457
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1736963271.2776558,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 819568.675452923,
+ "end_time": 821040.258353575,
+ "total_evaluation_time_seconds": "1471.5829006519634"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6eaf0278652329257116143bb5f309c976de823e
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.6772388059701493,
+ "prompt_level_strict_acc_stderr,none": 0.020213181858791902,
+ "inst_level_strict_acc,none": 0.875085324232082,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.746268656716418,
+ "prompt_level_loose_acc_stderr,none": 0.018812987595772077,
+ "inst_level_loose_acc,none": 0.9023890784982935,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738755395.0744658,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "d4df1727ff0f9895d83ccd0ac83f6b2c0cda091a0973481d411dffc518eff10c"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 719209.401361993,
+ "end_time": 730674.529977953,
+ "total_evaluation_time_seconds": "11465.128615959897"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..786e8eeced79bdbd7e68805ff416108950336a7d
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.9289256198347108,
+ "acc_stderr,none": 0.010455108438744632,
+ "acc_norm,none": 0.9289256198347108,
+ "acc_norm_stderr,none": 0.010455108438744632
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738750714.1959553,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "32a7b7c1c88d99ade511d812d9cbb111908e832b777672ce1804c2e7268cd3f1"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 714528.576624467,
+ "end_time": 714696.485377223,
+ "total_evaluation_time_seconds": "167.90875275596045"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/araPro_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d43f30a5702d24b9cd7a33050bd6ce139a9dc2a0
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.7468506298740252,
+ "acc_stderr,none": 0.006149223797046572,
+ "acc_norm,none": 0.7468506298740252,
+ "acc_norm_stderr,none": 0.006149223797046572
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738742689.16284,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "fbed9ef589c990a17dcead3fd4bf430d227ad1dbfc5eb985b0069893d506f012"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 706503.486828664,
+ "end_time": 710686.185591246,
+ "total_evaluation_time_seconds": "4182.698762582033"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..dc112b461e61a0729b56b91143f62cf7ae195175
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2049 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.7409892770667589,
+ "acc_stderr,none": 0.0035584337132555425,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7513781697905182,
+ "acc_stderr,none": 0.006946939990015845,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5723684210526315,
+ "acc_stderr,none": 0.01795774617649966
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7634730538922155,
+ "acc_stderr,none": 0.023287080919597573
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.07647191129018724
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.7104851330203443,
+ "acc_stderr,none": 0.01795571043620009
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.7783251231527094,
+ "acc_stderr,none": 0.029225575892489614
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7899159663865546,
+ "acc_stderr,none": 0.026461398717471874
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.7843137254901961,
+ "acc_stderr,none": 0.040925639582376536
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8828828828828829,
+ "acc_stderr,none": 0.010178797267994774
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7898089171974523,
+ "acc_stderr,none": 0.02303010888763848
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.7247873633049817,
+ "acc_stderr,none": 0.010540987217286251,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.8333333333333334,
+ "acc_stderr,none": 0.015076937921915374
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.7315068493150685,
+ "acc_stderr,none": 0.023228711080516603
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.02535100632816969
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.7936507936507936,
+ "acc_stderr,none": 0.025543433160843253
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7689210950080515,
+ "acc_stderr,none": 0.008435750027106902,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.7952105697770437,
+ "acc_stderr,none": 0.011601179745220788
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.7222222222222222,
+ "acc_stderr,none": 0.015246802523694777
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.813953488372093,
+ "acc_stderr,none": 0.02975860061821377
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7592592592592593,
+ "acc_stderr,none": 0.033694336336687475
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.04649905549752767
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.7240296803652968,
+ "acc_stderr,none": 0.00744343051257476,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.5747126436781609,
+ "acc_stderr,none": 0.05331106836455264
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.7694444444444445,
+ "acc_stderr,none": 0.02222947498481115
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.6628131021194605,
+ "acc_stderr,none": 0.014680522384815578
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.673728813559322,
+ "acc_stderr,none": 0.030584260959928
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.8160919540229885,
+ "acc_stderr,none": 0.04177540678018988
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.7573529411764706,
+ "acc_stderr,none": 0.026040662474201275
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5975103734439834,
+ "acc_stderr,none": 0.0316551553904741
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.8245614035087719,
+ "acc_stderr,none": 0.050825312758579544
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.8382978723404255,
+ "acc_stderr,none": 0.013876205392457564
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.7702702702702703,
+ "acc_stderr,none": 0.049234410091889724
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.6861313868613139,
+ "acc_stderr,none": 0.03979313298217895
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.7285714285714285,
+ "acc_stderr,none": 0.030760309824226048
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.7344190416536173,
+ "acc_stderr,none": 0.0076086967097943985,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.6323633782824698,
+ "acc_stderr,none": 0.012849653340567811
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.7969348659003831,
+ "acc_stderr,none": 0.02494838405532525
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.6862745098039216,
+ "acc_stderr,none": 0.029114341988755666
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8888888888888888,
+ "acc_stderr,none": 0.061633355136136575
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.871900826446281,
+ "acc_stderr,none": 0.021527727492467282
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.8105263157894737,
+ "acc_stderr,none": 0.028505397911003327
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.8068459657701712,
+ "acc_stderr,none": 0.01954416525001844
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.9017857142857143,
+ "acc_stderr,none": 0.016259852562706387
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.765625,
+ "acc_stderr,none": 0.053369535239372906
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.7409892770667589,
+ "acc_stderr,none": 0.0035584337132555425,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7513781697905182,
+ "acc_stderr,none": 0.006946939990015845,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.7247873633049817,
+ "acc_stderr,none": 0.010540987217286251,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.7689210950080515,
+ "acc_stderr,none": 0.008435750027106902,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.7240296803652968,
+ "acc_stderr,none": 0.00744343051257476,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.7344190416536173,
+ "acc_stderr,none": 0.0076086967097943985,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_biology"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_history",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "d3d951150c1e5848237cd6a7ad11df4836aee842",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1736538564.4503984,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 394861.850837854,
+ "end_time": 396260.981502118,
+ "total_evaluation_time_seconds": "1399.1306642639684"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a29369ed0ba4319ea9d26dda99e60c9d9901b9e0
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.7869634340222575,
+ "acc_stderr,none": 0.009428302519872343,
+ "acc_norm,none": 0.7869634340222575,
+ "acc_norm_stderr,none": 0.009428302519872343
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738747141.777552,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "9ee0561eac1b05912d0cfd3a411a4bd9fa40bebbe91a6dc8ae910b4b313ac82e"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 710956.08805107,
+ "end_time": 711211.783650537,
+ "total_evaluation_time_seconds": "255.69559946702793"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..636e93818139f635c49ff6aba7c8ab83cf731b70
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.6070763500931099,
+ "acc_stderr,none": 0.021095671164618357,
+ "acc_norm,none": 0.6070763500931099,
+ "acc_norm_stderr,none": 0.021095671164618357
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 4
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1736963084.4694233,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 819381.849324542,
+ "end_time": 819556.63667564,
+ "total_evaluation_time_seconds": "174.787351098028"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/gat_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..893653c72a8911e15a97aa126b6b92d6fd540302
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.5953844224256867,
+ "acc_stderr,none": 0.0038311989919646993,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.6289424860853432,
+ "acc_stderr,none": 0.009307376581390225
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.43460837887067394,
+ "acc_stderr,none": 0.00946306183627077
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.5885167464114832,
+ "acc_stderr,none": 0.009442578683608647
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.6526315789473685,
+ "acc_stderr,none": 0.014735977850381382
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.5663934426229508,
+ "acc_stderr,none": 0.014194012266806359
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.6966942148760331,
+ "acc_stderr,none": 0.013220512730306236
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.5789877300613497,
+ "acc_stderr,none": 0.013677598428520711
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.6958904109589041,
+ "acc_stderr,none": 0.024112086414249192
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.6737240075614367,
+ "acc_stderr,none": 0.009118068403217263
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.5953844224256867,
+ "acc_stderr,none": 0.0038311989919646993,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "d3d951150c1e5848237cd6a7ad11df4836aee842",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731688096.058723,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 167955.311820138,
+ "end_time": 174625.942128826,
+ "total_evaluation_time_seconds": "6670.630308687978"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0bbfd9f889b6566aa7ce3ef300f737fe6235e6f2
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.8687687687687687,
+ "acc_stderr,none": 0.0033783893179881157,
+ "acc_norm,none": 0.8687687687687687,
+ "acc_norm_stderr,none": 0.0033783893179881157
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738747465.0194297,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "71a5a06fce67b4990c903f05d6bf809044730e558d91137c54ee0d4a18b7cbb0"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 711279.474367515,
+ "end_time": 712232.826658995,
+ "total_evaluation_time_seconds": "953.352291480056"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..15fbd7f891b0ba26436d273ed5c39d0f44544247
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.8662201614288167,
+ "acc_stderr,none": 0.004461422745834223,
+ "acc_norm,none": 0.8662201614288167,
+ "acc_norm_stderr,none": 0.004461422745834223
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "495f39366efef23836d0cfae4fbe635880d2be31",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738748483.6156833,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "9cbb5e9c3c682994cd0172a65cc8a5452d2f55e936528a4ce347bbc1dbb57fe8"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 712298.076327804,
+ "end_time": 713334.635480347,
+ "total_evaluation_time_seconds": "1036.5591525429627"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-72B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Qwen2.5-72B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a665cf97e8a2725f7bb25cf300d36b25209874d1
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-72B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2653 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.7248344370860927,
+ "acc_stderr,none": 0.00790772330279595,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.56,
+ "acc_stderr,none": 0.049888765156985884
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.875,
+ "acc_stderr,none": 0.026913523521537846
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.8888888888888888,
+ "acc_stderr,none": 0.026280550932848073
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.04943110704237101
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.54,
+ "acc_stderr,none": 0.05009082659620333
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.5882352941176471,
+ "acc_stderr,none": 0.048971049527263666
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.77,
+ "acc_stderr,none": 0.04229525846816506
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.7829787234042553,
+ "acc_stderr,none": 0.026947483121496234
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.6929824561403509,
+ "acc_stderr,none": 0.04339138322579862
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.6827586206896552,
+ "acc_stderr,none": 0.038783523721386215
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.7301587301587301,
+ "acc_stderr,none": 0.022860838309232072
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.8806451612903226,
+ "acc_stderr,none": 0.018443411325315403
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.7044334975369458,
+ "acc_stderr,none": 0.032104944337514575
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.85,
+ "acc_stderr,none": 0.03588702812826369
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.5888888888888889,
+ "acc_stderr,none": 0.02999992350870668
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.6225165562913907,
+ "acc_stderr,none": 0.0395802723112157
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.7685185185185185,
+ "acc_stderr,none": 0.028765111718046948
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.8276053215077606,
+ "acc_stderr,none": 0.008832654533380828,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.8424242424242424,
+ "acc_stderr,none": 0.028450388805284343
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.8921568627450981,
+ "acc_stderr,none": 0.02177052228136839
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.869198312236287,
+ "acc_stderr,none": 0.021948766059470767
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.859504132231405,
+ "acc_stderr,none": 0.031722334260021585
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.8148148148148148,
+ "acc_stderr,none": 0.03755265865037183
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7852760736196319,
+ "acc_stderr,none": 0.03226219377286774
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.7363344051446945,
+ "acc_stderr,none": 0.02502553850053234
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.8611111111111112,
+ "acc_stderr,none": 0.019242526226544553
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.8070175438596491,
+ "acc_stderr,none": 0.030267457554898458
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.7144302090357384,
+ "acc_stderr,none": 0.0056155230824463725,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.6222222222222222,
+ "acc_stderr,none": 0.04188307537595853
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.7660377358490567,
+ "acc_stderr,none": 0.02605529690115292
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6705202312138728,
+ "acc_stderr,none": 0.03583901754736411
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.6428571428571429,
+ "acc_stderr,none": 0.042857142857142816
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.55,
+ "acc_stderr,none": 0.049999999999999996
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.8585858585858586,
+ "acc_stderr,none": 0.024825909793343343
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8954128440366973,
+ "acc_stderr,none": 0.013120530245265606
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.7309417040358744,
+ "acc_stderr,none": 0.029763779406874972
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.6607142857142857,
+ "acc_stderr,none": 0.0449394906861354
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.79,
+ "acc_stderr,none": 0.040936018074033256
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.8607918263090677,
+ "acc_stderr,none": 0.01237878610188513
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.8300653594771242,
+ "acc_stderr,none": 0.021505383121231354
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.5709219858156028,
+ "acc_stderr,none": 0.02952591430255856
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.5541069100391134,
+ "acc_stderr,none": 0.012695244711379774
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.8345588235294118,
+ "acc_stderr,none": 0.02257177102549475
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.761437908496732,
+ "acc_stderr,none": 0.017242385828779603
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5602409638554217,
+ "acc_stderr,none": 0.03864139923699121
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.7343274497869751,
+ "acc_stderr,none": 0.007406426245646063,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8911917098445595,
+ "acc_stderr,none": 0.022473253332768752
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.7923076923076923,
+ "acc_stderr,none": 0.020567539567246797
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.8865546218487395,
+ "acc_stderr,none": 0.020600225750204825
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.8320610687022901,
+ "acc_stderr,none": 0.032785485373431386
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8155339805825242,
+ "acc_stderr,none": 0.03840423627288276
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8290598290598291,
+ "acc_stderr,none": 0.024662496845209814
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.7514450867052023,
+ "acc_stderr,none": 0.023267528432100174
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.5441340782122905,
+ "acc_stderr,none": 0.016657229424586303
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.7090909090909091,
+ "acc_stderr,none": 0.04350271442923243
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7510204081632653,
+ "acc_stderr,none": 0.027682979522960234
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.8407960199004975,
+ "acc_stderr,none": 0.025870646766169146
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.81,
+ "acc_stderr,none": 0.039427724440366234
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.7248344370860927,
+ "acc_stderr,none": 0.00790772330279595,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.8276053215077606,
+ "acc_stderr,none": 0.008832654533380828,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.7144302090357384,
+ "acc_stderr,none": 0.0056155230824463725,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.7343274497869751,
+ "acc_stderr,none": 0.007406426245646063,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_international_law",
+ "openaimmlu_philosophy",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_jurisprudence"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_marketing",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_public_relations",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_security_studies",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_management",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_sociology",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_business_ethics"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_medicine",
+ "openaimmlu_global_facts",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_human_aging",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_anatomy",
+ "openaimmlu_nutrition",
+ "openaimmlu_formal_logic",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_professional_law",
+ "openaimmlu_virology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_clinical_knowledge"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_computer_security",
+ "openaimmlu_astronomy",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_physics",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_biology",
+ "openaimmlu_elementary_mathematics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=Qwen/Qwen2.5-72B-Instruct,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 72706203648,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "d3d951150c1e5848237cd6a7ad11df4836aee842",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731688102.6369689,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-72B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-72B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 167961.887782116,
+ "end_time": 174860.307504835,
+ "total_evaluation_time_seconds": "6898.4197227189725"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/acva_5_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d2261928a6e7b2e565eb70adeae4c4d6f8e78dd5
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7839265212399541,
+ "acc_stderr,none": 0.004410159183412007,
+ "acc_norm,none": 0.7817451205510907,
+ "acc_norm_stderr,none": 0.004426193797299392
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969210.259454,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3831.319873887,
+ "end_time": 4381.143410904,
+ "total_evaluation_time_seconds": "549.823537017"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/ar_ifeval_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..981b641e8a41c4258ea2a3ed53af6aa2e369fd11
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.28171641791044777,
+ "prompt_level_strict_acc_stderr,none": 0.019448099048933045,
+ "inst_level_strict_acc,none": 0.6518771331058021,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.332089552238806,
+ "prompt_level_loose_acc_stderr,none": 0.020361503053631682,
+ "inst_level_loose_acc,none": 0.6805460750853243,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738576311.7497714,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "ar_ifeval": "e94d7ab29bcea6c517c784b2aa65ffd558e3b4c84901ed8e147df9bd1f71c35c"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 1344856.801255893,
+ "end_time": 1348853.74844184,
+ "total_evaluation_time_seconds": "3996.947185947094"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/araMath_v3_5_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5cc000a6b5892dde5b89daf6337d006429a7f49b
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.7173553719008264,
+ "acc_stderr,none": 0.01832183956763465,
+ "acc_norm,none": 0.7173553719008264,
+ "acc_norm_stderr,none": 0.01832183956763465
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738675616.0209072,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araMath_v3": "4afa6622c31e4fb937d7ad0da2119b52cd56b8bedea0f95cc12cc332c35e09f6"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 674736.742411863,
+ "end_time": 674788.659606429,
+ "total_evaluation_time_seconds": "51.917194566107355"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/araPro_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c523654b22726e3d762a192125f86ab88a473b0b
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.6462707458508299,
+ "acc_stderr,none": 0.006761728608991266,
+ "acc_norm,none": 0.6462707458508299,
+ "acc_norm_stderr,none": 0.006761728608991266
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738745497.0234828,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "araPro": "59a5e15442970296d6c76ad4c1ea628b774166211f664b5c0f3eb594d33d6eb2"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 1109114.192296206,
+ "end_time": 1109669.826812652,
+ "total_evaluation_time_seconds": "555.6345164459199"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/arabicmmlu_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..293fb5edf160f5c570934fa481fa29bb32d1cf44
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/arabicmmlu_0_shot.json
@@ -0,0 +1,2049 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6154271878242823,
+ "acc_stderr,none": 0.003934302947200145,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6063947078280044,
+ "acc_stderr,none": 0.007795174544734088,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.44473684210526315,
+ "acc_stderr,none": 0.01803765580252778
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6616766467065869,
+ "acc_stderr,none": 0.02592786608977119
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6153846153846154,
+ "acc_stderr,none": 0.07892141169885801
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.43661971830985913,
+ "acc_stderr,none": 0.019635508583285048
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6748768472906403,
+ "acc_stderr,none": 0.032957975663112704
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.680672268907563,
+ "acc_stderr,none": 0.0302839955258844
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.5588235294117647,
+ "acc_stderr,none": 0.04940635630605659
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7497497497497497,
+ "acc_stderr,none": 0.0137113480237793
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7420382165605095,
+ "acc_stderr,none": 0.024729688908190262
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6233292831105711,
+ "acc_stderr,none": 0.011465056502784907,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7320261437908496,
+ "acc_stderr,none": 0.017917974069594722
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.6931506849315069,
+ "acc_stderr,none": 0.02417273080537769
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.38461538461538464,
+ "acc_stderr,none": 0.024666744915187208
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7037037037037037,
+ "acc_stderr,none": 0.0895511888632576
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.6190476190476191,
+ "acc_stderr,none": 0.030652119793011915
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.643719806763285,
+ "acc_stderr,none": 0.0095709414757183,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6688687035507844,
+ "acc_stderr,none": 0.01352937914199443
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.5810185185185185,
+ "acc_stderr,none": 0.01679527052480067
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.686046511627907,
+ "acc_stderr,none": 0.03549043982227172
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7098765432098766,
+ "acc_stderr,none": 0.035765960830111604
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.72,
+ "acc_stderr,none": 0.052195060344100805
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6098744292237442,
+ "acc_stderr,none": 0.00810834354787168,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.45977011494252873,
+ "acc_stderr,none": 0.053741581963657706
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.6527777777777778,
+ "acc_stderr,none": 0.02512691742803579
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.5144508670520231,
+ "acc_stderr,none": 0.01552026616876521
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5466101694915254,
+ "acc_stderr,none": 0.032474375633194844
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7701149425287356,
+ "acc_stderr,none": 0.04537158185250774
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.02841820861940675
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5394190871369294,
+ "acc_stderr,none": 0.03217440335948302
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.0629940788348712
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7319148936170212,
+ "acc_stderr,none": 0.01669476485201052
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.7162162162162162,
+ "acc_stderr,none": 0.05276603149821337
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5912408759124088,
+ "acc_stderr,none": 0.042154748403487034
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.6190476190476191,
+ "acc_stderr,none": 0.03359110046749989
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6056999686814908,
+ "acc_stderr,none": 0.008320757741917867,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.4868701206529453,
+ "acc_stderr,none": 0.013320449671536705
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.6513409961685823,
+ "acc_stderr,none": 0.029554116131305663
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.4588235294117647,
+ "acc_stderr,none": 0.031266224025969486
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.9259259259259259,
+ "acc_stderr,none": 0.05136112928011382
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.7603305785123967,
+ "acc_stderr,none": 0.027497867883503148
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7368421052631579,
+ "acc_stderr,none": 0.032030558918430804
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.7041564792176039,
+ "acc_stderr,none": 0.022596206734926304
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8214285714285714,
+ "acc_stderr,none": 0.020925145443913138
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.05455447255899809
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6154271878242823,
+ "acc_stderr,none": 0.003934302947200145,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6063947078280044,
+ "acc_stderr,none": 0.007795174544734088,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6233292831105711,
+ "acc_stderr,none": 0.011465056502784907,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.643719806763285,
+ "acc_stderr,none": 0.0095709414757183,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6098744292237442,
+ "acc_stderr,none": 0.00810834354787168,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.6056999686814908,
+ "acc_stderr,none": 0.008320757741917867,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_biology"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_history",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "bb46c15ee4bb56c5b63245ef50fd7637234d6f75",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1736532429.570835,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 388723.796268486,
+ "end_time": 388932.518572279,
+ "total_evaluation_time_seconds": "208.7223037930089"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/etec_v2_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..de3a7a7e958c47d31652f8cf0c753ab878aefd26
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.6412294647588765,
+ "acc_stderr,none": 0.011044454621265165,
+ "acc_norm,none": 0.6412294647588765,
+ "acc_norm_stderr,none": 0.011044454621265165
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682237.6531827,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "etec_v2": "ccf52ad4d1e05dccde272349596fb8819b25302b4afaa8ddefdc7288f9965839"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 1057097.582369473,
+ "end_time": 1057186.664077031,
+ "total_evaluation_time_seconds": "89.08170755789615"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/exams_ar_5_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6481aee41387d362ea56fe0e4e760b125f419bf4
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.5065176908752328,
+ "acc_stderr,none": 0.02159487569233192,
+ "acc_norm,none": 0.5065176908752328,
+ "acc_norm_stderr,none": 0.02159487569233192
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737022505.297799,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 101438.818449475,
+ "end_time": 101848.977613468,
+ "total_evaluation_time_seconds": "410.1591639929975"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/gat_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..955b0fb4dca3f882dd405afe009e0d75919fc8cc
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.4142104603035244,
+ "acc_stderr,none": 0.0038397567806533668,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.3888682745825603,
+ "acc_stderr,none": 0.009392255011265211
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.3493624772313297,
+ "acc_stderr,none": 0.009101555643753388
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.36474052263525947,
+ "acc_stderr,none": 0.009236399342894993
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.5023923444976076,
+ "acc_stderr,none": 0.0154744343816748
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.30901639344262294,
+ "acc_stderr,none": 0.013234964445015209
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.5462809917355372,
+ "acc_stderr,none": 0.01431819857472042
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.32745398773006135,
+ "acc_stderr,none": 0.013000616127135718
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.43561643835616437,
+ "acc_stderr,none": 0.025988942967463693
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5512287334593573,
+ "acc_stderr,none": 0.00967270003130818
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.4142104603035244,
+ "acc_stderr,none": 0.0038397567806533668,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=False,pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,mm=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "bb46c15ee4bb56c5b63245ef50fd7637234d6f75",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1730951149.5236645,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 8058.842983944,
+ "end_time": 9035.124412401,
+ "total_evaluation_time_seconds": "976.2814284570013"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_mcq_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b61906edb1e56631c9076905d90388fcae5800d7
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.6637637637637638,
+ "acc_stderr,none": 0.004726808644291313,
+ "acc_norm,none": 0.6637637637637638,
+ "acc_norm_stderr,none": 0.004726808644291313
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738674600.0544074,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_mcq": "ce48b9a14bd92b18b8dc937edb46c180c4856590e207dc535b0ed1f5e8d9a7a5"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 673720.647842419,
+ "end_time": 674046.632315245,
+ "total_evaluation_time_seconds": "325.9844728260068"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_tf_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6e41097f92b96c06aa389a7ea1f1fb8965b290ed
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7846470891293148,
+ "acc_stderr,none": 0.005387365696365709,
+ "acc_norm,none": 0.7846470891293148,
+ "acc_norm_stderr,none": 0.005387365696365709
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682461.636686,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {
+ "moe_ien_tf": "f4ddc3d519c912c82ff8c20b8732077ac9136d725beb5ceddd9896a9640d070e"
+ },
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+ "chat_template_sha": "cd8e9439f0570856fd70470bf8889ebd8b5d1107207f67a5efb46e342330527f",
+ "start_time": 1057321.504482153,
+ "end_time": 1057680.019318038,
+ "total_evaluation_time_seconds": "358.5148358847946"
+}
\ No newline at end of file
diff --git a/evaluations/ar/Qwen2.5-7B-Instruct/openaimmlu_0_shot.json b/evaluations/ar/Qwen2.5-7B-Instruct/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6977605ca30b1faec14b58c50f2fcc5f95eed931
--- /dev/null
+++ b/evaluations/ar/Qwen2.5-7B-Instruct/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.5609599772112235,
+ "acc_stderr,none": 0.004081928547170564,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.5526490066225166,
+ "acc_stderr,none": 0.008946495867881253,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.44,
+ "acc_stderr,none": 0.0498887651569859
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.6776315789473685,
+ "acc_stderr,none": 0.038035102483515854
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5694444444444444,
+ "acc_stderr,none": 0.04140685639111502
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.049756985195624284
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.53,
+ "acc_stderr,none": 0.05016135580465919
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.049756985195624284
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.38235294117647056,
+ "acc_stderr,none": 0.04835503696107223
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.62,
+ "acc_stderr,none": 0.04878317312145633
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.574468085106383,
+ "acc_stderr,none": 0.03232146916224468
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.49122807017543857,
+ "acc_stderr,none": 0.04702880432049615
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5310344827586206,
+ "acc_stderr,none": 0.04158632762097828
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.5978835978835979,
+ "acc_stderr,none": 0.025253032554997695
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.6483870967741936,
+ "acc_stderr,none": 0.02716253782694846
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.5714285714285714,
+ "acc_stderr,none": 0.03481904844438804
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.72,
+ "acc_stderr,none": 0.04512608598542128
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.44814814814814813,
+ "acc_stderr,none": 0.03032116719631629
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.48344370860927155,
+ "acc_stderr,none": 0.040802441856289715
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.5694444444444444,
+ "acc_stderr,none": 0.03376922151252336
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.667960088691796,
+ "acc_stderr,none": 0.011032930411432253,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7515151515151515,
+ "acc_stderr,none": 0.03374402644139405
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7058823529411765,
+ "acc_stderr,none": 0.03198001660115071
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7468354430379747,
+ "acc_stderr,none": 0.028304657943035286
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.71900826446281,
+ "acc_stderr,none": 0.04103203830514512
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.6851851851851852,
+ "acc_stderr,none": 0.04489931073591312
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6319018404907976,
+ "acc_stderr,none": 0.03789213935838396
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.594855305466238,
+ "acc_stderr,none": 0.027882383791325946
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.6327160493827161,
+ "acc_stderr,none": 0.026822801759507894
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6198830409356725,
+ "acc_stderr,none": 0.037229657413855394
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5257923128792987,
+ "acc_stderr,none": 0.006334789144427399,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4666666666666667,
+ "acc_stderr,none": 0.043097329010363554
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.6150943396226415,
+ "acc_stderr,none": 0.02994649856769995
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.5549132947976878,
+ "acc_stderr,none": 0.03789401760283648
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.46825396825396826,
+ "acc_stderr,none": 0.04463112720677171
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.6868686868686869,
+ "acc_stderr,none": 0.03304205087813653
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.6642201834862386,
+ "acc_stderr,none": 0.02024808139675293
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.5560538116591929,
+ "acc_stderr,none": 0.03334625674242728
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.4017857142857143,
+ "acc_stderr,none": 0.04653333146973646
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.53,
+ "acc_stderr,none": 0.05016135580465919
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.6602809706257982,
+ "acc_stderr,none": 0.016936394114301652
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6535947712418301,
+ "acc_stderr,none": 0.027245613047215362
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.425531914893617,
+ "acc_stderr,none": 0.029494827600144366
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3983050847457627,
+ "acc_stderr,none": 0.012503310565166244
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4742647058823529,
+ "acc_stderr,none": 0.030332578094555033
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.5343137254901961,
+ "acc_stderr,none": 0.02018014484330729
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4457831325301205,
+ "acc_stderr,none": 0.03869543323472101
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5733414485696896,
+ "acc_stderr,none": 0.008318351078531525,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.04688261722621504
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.689119170984456,
+ "acc_stderr,none": 0.03340361906276588
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.5820512820512821,
+ "acc_stderr,none": 0.025007329882461213
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.6932773109243697,
+ "acc_stderr,none": 0.029953823891887048
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.5954198473282443,
+ "acc_stderr,none": 0.043046937953806645
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6116504854368932,
+ "acc_stderr,none": 0.0482572933735639
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.7393162393162394,
+ "acc_stderr,none": 0.028760348956523414
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.6069364161849711,
+ "acc_stderr,none": 0.026296227915613674
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.3675977653631285,
+ "acc_stderr,none": 0.016125543823552944
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5636363636363636,
+ "acc_stderr,none": 0.04750185058907297
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6653061224489796,
+ "acc_stderr,none": 0.030209235226242307
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.7064676616915423,
+ "acc_stderr,none": 0.03220024104534205
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.71,
+ "acc_stderr,none": 0.045604802157206845
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.5609599772112235,
+ "acc_stderr,none": 0.004081928547170564,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.5526490066225166,
+ "acc_stderr,none": 0.008946495867881253,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.667960088691796,
+ "acc_stderr,none": 0.011032930411432253,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5257923128792987,
+ "acc_stderr,none": 0.006334789144427399,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5733414485696896,
+ "acc_stderr,none": 0.008318351078531525,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_world_religions",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_international_law",
+ "openaimmlu_philosophy"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_management",
+ "openaimmlu_security_studies",
+ "openaimmlu_sociology",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_business_ethics",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_marketing",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_public_relations",
+ "openaimmlu_us_foreign_policy"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_law",
+ "openaimmlu_college_medicine",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_anatomy",
+ "openaimmlu_human_aging",
+ "openaimmlu_virology",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_formal_logic",
+ "openaimmlu_machine_learning",
+ "openaimmlu_global_facts",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_nutrition"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_computer_security",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_astronomy",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_biology",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_econometrics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_elementary_mathematics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7615616512,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "a09a35458c702b33eeacc393d103063234e8bc28",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969785.9646149,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|im_end|>",
+ "151645"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151645,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "Qwen/Qwen2.5-7B-Instruct",
+ "model_name_sanitized": "Qwen__Qwen2.5-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4407.244924083,
+ "end_time": 4664.374890576,
+ "total_evaluation_time_seconds": "257.1299664930002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/acva_5_shot.json b/evaluations/ar/jais-adapted-13b-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..011af7a3d7f7f256e03de507f6dbc764912eb1c8
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.687256027554535,
+ "acc_stderr,none": 0.004967862964573529,
+ "acc_norm,none": 0.6778415614236509,
+ "acc_norm_stderr,none": 0.005007427931089761
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736966908.572879,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2521.699275185,
+ "end_time": 4052.888725241,
+ "total_evaluation_time_seconds": "1531.1894500560002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..42904805c21fe6f370da10244383c044b10d6301
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.2332089552238806,
+ "prompt_level_strict_acc_stderr,none": 0.01828244336455248,
+ "inst_level_strict_acc,none": 0.6061433447098976,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.2667910447761194,
+ "prompt_level_loose_acc_stderr,none": 0.019121528856258296,
+ "inst_level_loose_acc,none": 0.6320819112627987,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739704490.831331,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "ar_ifeval": "4b20e2959680620fd181f30d91c0274af9a3e1cc023b746ee5e02809d7d45954"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 22053.916395924,
+ "end_time": 27118.428955004,
+ "total_evaluation_time_seconds": "5064.512559080002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-adapted-13b-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6f4782018cbcee33a74397fc4e54cd89681dd3ec
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.3702479338842975,
+ "acc_stderr,none": 0.019647742288895164,
+ "acc_norm,none": 0.3702479338842975,
+ "acc_norm_stderr,none": 0.019647742288895164
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739703677.3071382,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araMath_v3": "b3fe722cebee19d37f6462a65a71854be30c8fada0a636e26fe49e070b49d07e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 21240.529037809,
+ "end_time": 21359.76294948,
+ "total_evaluation_time_seconds": "119.23391167099908"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/araPro_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d541fb90a4d3b735848ba432b6c8daff10985df8
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5906818636272746,
+ "acc_stderr,none": 0.006953801832222118,
+ "acc_norm,none": 0.5906818636272746,
+ "acc_norm_stderr,none": 0.006953801832222118
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739698039.0639462,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araPro": "ecf84d12784310b52b252574c7d56efbe3005c09fb41c792c4fa6a74fcae7239"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 15602.185312998,
+ "end_time": 17410.90263479,
+ "total_evaluation_time_seconds": "1808.717321791999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..059a193cbdbd22455929ed931857de06f79ba75d
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5641646489104116,
+ "acc_stderr,none": 0.004021442558151118,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5871003307607497,
+ "acc_stderr,none": 0.007950845213975143,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.4276315789473684,
+ "acc_stderr,none": 0.017957746176499655
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6407185628742516,
+ "acc_stderr,none": 0.02629232101454999
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6153846153846154,
+ "acc_stderr,none": 0.07892141169885801
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.543035993740219,
+ "acc_stderr,none": 0.01972172803805194
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6699507389162561,
+ "acc_stderr,none": 0.03308530426228258
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6680672268907563,
+ "acc_stderr,none": 0.03058869701378364
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6078431372549019,
+ "acc_stderr,none": 0.04858083574266345
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7237237237237237,
+ "acc_stderr,none": 0.014154447789569535
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.445859872611465,
+ "acc_stderr,none": 0.0280955038645063
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5656136087484812,
+ "acc_stderr,none": 0.011992111540822362,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6584967320261438,
+ "acc_stderr,none": 0.019184639328092487
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5589041095890411,
+ "acc_stderr,none": 0.026024624110486106
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4025641025641026,
+ "acc_stderr,none": 0.02486499515976776
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.5925925925925926,
+ "acc_stderr,none": 0.09636202008710973
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5992063492063492,
+ "acc_stderr,none": 0.030932267624392513
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6139291465378421,
+ "acc_stderr,none": 0.009743350257283902,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.5887696118909992,
+ "acc_stderr,none": 0.014145640218596737
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6099537037037037,
+ "acc_stderr,none": 0.016603556245640024
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6569767441860465,
+ "acc_stderr,none": 0.03630268317574833
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7407407407407407,
+ "acc_stderr,none": 0.03453721512001164
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6933333333333334,
+ "acc_stderr,none": 0.05360292224565066
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5687785388127854,
+ "acc_stderr,none": 0.00826434190147144,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4482758620689655,
+ "acc_stderr,none": 0.05362711627041053
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.55,
+ "acc_stderr,none": 0.026256714222894103
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.4903660886319846,
+ "acc_stderr,none": 0.015523869937978127
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5805084745762712,
+ "acc_stderr,none": 0.03219081311534769
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.8045977011494253,
+ "acc_stderr,none": 0.042756781109738684
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6360294117647058,
+ "acc_stderr,none": 0.029227192460032022
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5311203319502075,
+ "acc_stderr,none": 0.032212285760463914
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7192982456140351,
+ "acc_stderr,none": 0.060045857397047285
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.01776672636296762
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.6216216216216216,
+ "acc_stderr,none": 0.056762926975479834
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.4744525547445255,
+ "acc_stderr,none": 0.04281864355155347
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5571428571428572,
+ "acc_stderr,none": 0.034359114868310274
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.49357970560601316,
+ "acc_stderr,none": 0.008479533288229812,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.37260468417317244,
+ "acc_stderr,none": 0.012885268232861912
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5325670498084292,
+ "acc_stderr,none": 0.030942837326193826
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.37254901960784315,
+ "acc_stderr,none": 0.030336449815198712
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8888888888888888,
+ "acc_stderr,none": 0.06163335513613657
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.6487603305785123,
+ "acc_stderr,none": 0.03074931190716626
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6526315789473685,
+ "acc_stderr,none": 0.03463365347393425
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5158924205378973,
+ "acc_stderr,none": 0.02474118138443798
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.7678571428571429,
+ "acc_stderr,none": 0.02306723145991075
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.671875,
+ "acc_stderr,none": 0.05915529526875285
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5641646489104116,
+ "acc_stderr,none": 0.004021442558151118,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5871003307607497,
+ "acc_stderr,none": 0.007950845213975143,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5656136087484812,
+ "acc_stderr,none": 0.011992111540822362,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6139291465378421,
+ "acc_stderr,none": 0.009743350257283902,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.5687785388127854,
+ "acc_stderr,none": 0.00826434190147144,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.49357970560601316,
+ "acc_stderr,none": 0.008479533288229812,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_high_biology",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_high_history",
+ "arabicmmlu_middle_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_high_civics"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_middle_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735750896.3142433,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 5215.109457726,
+ "end_time": 5838.270771199,
+ "total_evaluation_time_seconds": "623.1613134729996"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/etec_v2_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..dadc121b2c2fcd06157ceccc9edff500f1f185c3
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.48118706942236356,
+ "acc_stderr,none": 0.01150512988177613,
+ "acc_norm,none": 0.48118706942236356,
+ "acc_norm_stderr,none": 0.01150512988177613
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739700528.9637535,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "etec_v2": "96d83c3dfc0ddb3d56ef40f620488675ad72862342308d216d4140d7d20ecd38"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 18092.127684813,
+ "end_time": 18248.631595805,
+ "total_evaluation_time_seconds": "156.50391099199987"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/exams_ar_5_shot.json b/evaluations/ar/jais-adapted-13b-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9d6dd8ec34324accca5bc28f7f434df30315e149
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/exams_ar_5_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4823091247672253,
+ "acc_stderr,none": 0.021583188287808135,
+ "acc_norm,none": 0.4823091247672253,
+ "acc_norm_stderr,none": 0.021583188287808135
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735747936.9690704,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2255.790595856,
+ "end_time": 2608.906088715,
+ "total_evaluation_time_seconds": "353.1154928589999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/gat_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..2d0770cd13f3caa98f419fa364d2ab3343a7d2e4
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.33400225761946567,
+ "acc_stderr,none": 0.003661710170227351,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.2690166975881262,
+ "acc_stderr,none": 0.008543671687979955
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.29326047358834245,
+ "acc_stderr,none": 0.008690892996182613
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.28487302171512696,
+ "acc_stderr,none": 0.0086606873206029
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.39712918660287083,
+ "acc_stderr,none": 0.01514355305056311
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.16967213114754098,
+ "acc_stderr,none": 0.010750488821112222
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.4388429752066116,
+ "acc_stderr,none": 0.014271960233219975
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.36809815950920244,
+ "acc_stderr,none": 0.013360860368019332
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.26301369863013696,
+ "acc_stderr,none": 0.023076407542407414
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.4888468809073724,
+ "acc_stderr,none": 0.009721453573508959
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.33400225761946567,
+ "acc_stderr,none": 0.003661710170227351,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735750231.7451465,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4550.419244017,
+ "end_time": 5184.983570193,
+ "total_evaluation_time_seconds": "634.5643261759997"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b65cf8eba3c1d01c4a5ab30a48607822848f152d
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.6964964964964965,
+ "acc_stderr,none": 0.004600238156515683,
+ "acc_norm,none": 0.6964964964964965,
+ "acc_norm_stderr,none": 0.004600238156515683
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739701368.6168373,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_mcq": "64c1f30e4acb02ea085279bfa8affcb9f9f8f00136eb0d89b2fd705e17950843"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 18931.853201606,
+ "end_time": 19542.859156415,
+ "total_evaluation_time_seconds": "611.0059548089994"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6981d7264ac95c3e2ed30c1c75b6e825c3bcec45
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7185299673707711,
+ "acc_stderr,none": 0.0058938953996447606,
+ "acc_norm,none": 0.7185299673707711,
+ "acc_norm_stderr,none": 0.0058938953996447606
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739702661.550345,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_tf": "1b5f087aef767b97dbc9faaaacace59a2c0298137e4e95b34f3a681282d72c46"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 20224.680398667,
+ "end_time": 20560.877157062,
+ "total_evaluation_time_seconds": "336.1967583950027"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-13b-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-adapted-13b-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..04868ecaabc5051af530a735eb75cc8488015094
--- /dev/null
+++ b/evaluations/ar/jais-adapted-13b-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2655 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.38311258278145693,
+ "acc_stderr,none": 0.008696620138718551,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.04725815626252605
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.46710526315789475,
+ "acc_stderr,none": 0.040601270352363966
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.4930555555555556,
+ "acc_stderr,none": 0.041808067502949374
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001974
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.28431372549019607,
+ "acc_stderr,none": 0.04488482852329017
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.61,
+ "acc_stderr,none": 0.04902071300001975
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.4085106382978723,
+ "acc_stderr,none": 0.03213418026701576
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.2807017543859649,
+ "acc_stderr,none": 0.042270544512322004
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4482758620689655,
+ "acc_stderr,none": 0.04144311810878152
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.29894179894179895,
+ "acc_stderr,none": 0.0235776047916558
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5516129032258065,
+ "acc_stderr,none": 0.02829205683011273
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.3497536945812808,
+ "acc_stderr,none": 0.03355400904969565
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.049999999999999996
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.2851851851851852,
+ "acc_stderr,none": 0.027528599210340492
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.33774834437086093,
+ "acc_stderr,none": 0.038615575462551684
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.3148148148148148,
+ "acc_stderr,none": 0.03167468706828978
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5881374722838137,
+ "acc_stderr,none": 0.011494635862007822,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.6424242424242425,
+ "acc_stderr,none": 0.03742597043806587
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6274509803921569,
+ "acc_stderr,none": 0.03393388584958404
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7215189873417721,
+ "acc_stderr,none": 0.029178682304842538
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6033057851239669,
+ "acc_stderr,none": 0.04465869780531009
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5833333333333334,
+ "acc_stderr,none": 0.04766075165356461
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.5276073619631901,
+ "acc_stderr,none": 0.039223782906109894
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.5530546623794212,
+ "acc_stderr,none": 0.02823776942208532
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.49074074074074076,
+ "acc_stderr,none": 0.027815973433878014
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6023391812865497,
+ "acc_stderr,none": 0.0375363895576169
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.46830748482805123,
+ "acc_stderr,none": 0.006345172555588976,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4222222222222222,
+ "acc_stderr,none": 0.042667634040995814
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.4679245283018868,
+ "acc_stderr,none": 0.030709486992556555
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.42196531791907516,
+ "acc_stderr,none": 0.0376574669386515
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.31746031746031744,
+ "acc_stderr,none": 0.04163453031302859
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.6515151515151515,
+ "acc_stderr,none": 0.033948539651564025
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.5761467889908257,
+ "acc_stderr,none": 0.021187263209087526
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.5560538116591929,
+ "acc_stderr,none": 0.03334625674242728
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04109974682633932
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.46,
+ "acc_stderr,none": 0.05009082659620332
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.6283524904214559,
+ "acc_stderr,none": 0.01728080252213318
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5392156862745098,
+ "acc_stderr,none": 0.028541722692618874
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3404255319148936,
+ "acc_stderr,none": 0.02826765748265015
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.38396349413298564,
+ "acc_stderr,none": 0.012421587833134233
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.41544117647058826,
+ "acc_stderr,none": 0.029935342707877746
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.44281045751633985,
+ "acc_stderr,none": 0.020095083154577347
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5120481927710844,
+ "acc_stderr,none": 0.03891364495835816
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.4808277541083384,
+ "acc_stderr,none": 0.008288079309193879,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.58,
+ "acc_stderr,none": 0.049604496374885836
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.694300518134715,
+ "acc_stderr,none": 0.03324837939758159
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.441025641025641,
+ "acc_stderr,none": 0.02517404838400076
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.4327731092436975,
+ "acc_stderr,none": 0.03218358107742613
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.5877862595419847,
+ "acc_stderr,none": 0.04317171194870255
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.5728155339805825,
+ "acc_stderr,none": 0.04897957737781169
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6752136752136753,
+ "acc_stderr,none": 0.03067902276549883
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.5404624277456648,
+ "acc_stderr,none": 0.02683080599895224
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2547486033519553,
+ "acc_stderr,none": 0.014572650383409155
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5727272727272728,
+ "acc_stderr,none": 0.04738198703545483
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6204081632653061,
+ "acc_stderr,none": 0.03106721126287247
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6119402985074627,
+ "acc_stderr,none": 0.034457899643627506
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.66,
+ "acc_stderr,none": 0.04760952285695238
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.38311258278145693,
+ "acc_stderr,none": 0.008696620138718551,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5881374722838137,
+ "acc_stderr,none": 0.011494635862007822,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.46830748482805123,
+ "acc_stderr,none": 0.006345172555588976,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.4808277541083384,
+ "acc_stderr,none": 0.008288079309193879,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_international_law",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_prehistory"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_human_sexuality",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_business_ethics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_security_studies",
+ "openaimmlu_sociology",
+ "openaimmlu_management",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_marketing",
+ "openaimmlu_public_relations"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_formal_logic",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_virology",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_human_aging",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_law",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_global_facts",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_anatomy",
+ "openaimmlu_nutrition"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_physics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_college_biology",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_econometrics",
+ "openaimmlu_college_chemistry"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13343544320,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "ee47988c252bba70001d697afb666bbb4c9fd5aa",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736968465.307927,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4078.619322506,
+ "end_time": 4483.77898923,
+ "total_evaluation_time_seconds": "405.15966672399963"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/acva_5_shot.json b/evaluations/ar/jais-adapted-70b-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9c95217f2c20847f6d2585d4c1b45c371be03786
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/acva_5_shot.json
@@ -0,0 +1,117 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7594718714121699,
+ "acc_stderr,none": 0.004579885680577204,
+ "acc_norm,none": 0.7332950631458094,
+ "acc_norm_stderr,none": 0.0047388260011884484
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735754509.3437214,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4236927.625948693,
+ "end_time": 4237358.337916494,
+ "total_evaluation_time_seconds": "430.7119678016752"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6e648bee96ab49f90aa437086f0a0733537e4bd0
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.27052238805970147,
+ "prompt_level_strict_acc_stderr,none": 0.019205724692615982,
+ "inst_level_strict_acc,none": 0.6505119453924915,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.31343283582089554,
+ "prompt_level_loose_acc_stderr,none": 0.02005565588999481,
+ "inst_level_loose_acc,none": 0.6798634812286689,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738760932.3293223,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "ar_ifeval": "09fb0c6580f0a42624590f94c9483581a566f54a07cf60f59a60d159e4c054e2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 753707.325766823,
+ "end_time": 767341.93390049,
+ "total_evaluation_time_seconds": "13634.608133667032"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-adapted-70b-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1485b22662229b700bf5a297c2284e7d63238dae
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.456198347107438,
+ "acc_stderr,none": 0.02026649500712872,
+ "acc_norm,none": 0.456198347107438,
+ "acc_norm_stderr,none": 0.02026649500712872
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738755169.9928548,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araMath_v3": "d0d66a51e36e6cb52cf906fef452bc518aad1a1e641c82f522dc8014f42cc48e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 747945.013267984,
+ "end_time": 748222.617730487,
+ "total_evaluation_time_seconds": "277.6044625029899"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/araPro_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ad6206be695bfc4202366a800d416d69bb295d50
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.645870825834833,
+ "acc_stderr,none": 0.0067634562491415175,
+ "acc_norm,none": 0.645870825834833,
+ "acc_norm_stderr,none": 0.0067634562491415175
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738742634.7898378,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araPro": "6801d81fb64458427c0b7638660f113d7777c17252b7552d3a623eccf14d861c"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 735409.963649845,
+ "end_time": 743076.317063995,
+ "total_evaluation_time_seconds": "7666.353414150071"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f7ad20a5f9db2daaf9be766c55f4623feb026022
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6573503977862332,
+ "acc_stderr,none": 0.003840281351500485,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7036934950385888,
+ "acc_stderr,none": 0.007378737509782706,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5223684210526316,
+ "acc_stderr,none": 0.018130679701241173
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.7095808383233533,
+ "acc_stderr,none": 0.02487662483308632
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.717948717948718,
+ "acc_stderr,none": 0.07299934324587597
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.6932707355242567,
+ "acc_stderr,none": 0.01825654959511757
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6995073891625616,
+ "acc_stderr,none": 0.03225799476233485
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7310924369747899,
+ "acc_stderr,none": 0.02880139219363128
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.04690650298201943
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8278278278278278,
+ "acc_stderr,none": 0.011950503938766361
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.7547770700636943,
+ "acc_stderr,none": 0.024317432483448788
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6688942891859052,
+ "acc_stderr,none": 0.011240306622831422,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7761437908496732,
+ "acc_stderr,none": 0.016863008585416617
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.684931506849315,
+ "acc_stderr,none": 0.02434867698272133
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4666666666666667,
+ "acc_stderr,none": 0.02529460802398647
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7037037037037037,
+ "acc_stderr,none": 0.0895511888632576
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.6944444444444444,
+ "acc_stderr,none": 0.029075486178441058
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.714975845410628,
+ "acc_stderr,none": 0.009053330450889227,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.7142857142857143,
+ "acc_stderr,none": 0.012987012987013052
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6921296296296297,
+ "acc_stderr,none": 0.015713476123598046
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.7674418604651163,
+ "acc_stderr,none": 0.0323065408320345
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7654320987654321,
+ "acc_stderr,none": 0.03339448023577033
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.76,
+ "acc_stderr,none": 0.04964740541926503
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6269977168949772,
+ "acc_stderr,none": 0.008066232886874773,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.45977011494252873,
+ "acc_stderr,none": 0.053741581963657706
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.6444444444444445,
+ "acc_stderr,none": 0.025263833600917815
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.5452793834296724,
+ "acc_stderr,none": 0.015462954686403765
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.6016949152542372,
+ "acc_stderr,none": 0.0319346503074861
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7816091954022989,
+ "acc_stderr,none": 0.044551545932103705
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6727941176470589,
+ "acc_stderr,none": 0.028501452860396563
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5767634854771784,
+ "acc_stderr,none": 0.031892225234464444
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.6842105263157895,
+ "acc_stderr,none": 0.06211545730021919
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7475177304964539,
+ "acc_stderr,none": 0.016373437342591536
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.6081081081081081,
+ "acc_stderr,none": 0.05713629906375233
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.6131386861313869,
+ "acc_stderr,none": 0.04176260268579586
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.6285714285714286,
+ "acc_stderr,none": 0.033422722963748645
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5872220482305043,
+ "acc_stderr,none": 0.008392168384789572,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.48190205819730303,
+ "acc_stderr,none": 0.013316313061005655
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.6436781609195402,
+ "acc_stderr,none": 0.029700853786923786
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.4627450980392157,
+ "acc_stderr,none": 0.031285582720181296
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.9259259259259259,
+ "acc_stderr,none": 0.05136112928011382
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.768595041322314,
+ "acc_stderr,none": 0.027166056421232626
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7526315789473684,
+ "acc_stderr,none": 0.03138574519882399
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5696821515892421,
+ "acc_stderr,none": 0.024512121738684653
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8363095238095238,
+ "acc_stderr,none": 0.020214957089599812
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.65625,
+ "acc_stderr,none": 0.05983919423477113
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6573503977862332,
+ "acc_stderr,none": 0.003840281351500485,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.7036934950385888,
+ "acc_stderr,none": 0.007378737509782706,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6688942891859052,
+ "acc_stderr,none": 0.011240306622831422,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.714975845410628,
+ "acc_stderr,none": 0.009053330450889227,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6269977168949772,
+ "acc_stderr,none": 0.008066232886874773,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5872220482305043,
+ "acc_stderr,none": 0.008392168384789572,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_middle_natural_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_high_geography"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.8,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735742245.74136,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 157154.208849809,
+ "end_time": 157971.604345979,
+ "total_evaluation_time_seconds": "817.3954961700074"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/etec_v2_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f70a7110ec59d13eea77fcb2737dc7d123b96526
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.5680975092739798,
+ "acc_stderr,none": 0.011406002243769555,
+ "acc_norm,none": 0.5680975092739798,
+ "acc_norm_stderr,none": 0.011406002243769555
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738750590.832167,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "etec_v2": "d74045de4716b9652a4bfefbbb9f15b8700f98c226ac24538bb01ca5e0c7c2b2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 743365.908676943,
+ "end_time": 743722.955220173,
+ "total_evaluation_time_seconds": "357.0465432299534"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/exams_ar_5_shot.json b/evaluations/ar/jais-adapted-70b-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f5c51284805c4f266372001aad2dfbb03271a75d
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/exams_ar_5_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.547486033519553,
+ "acc_stderr,none": 0.021499092163260354,
+ "acc_norm,none": 0.547486033519553,
+ "acc_norm_stderr,none": 0.021499092163260354
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 0.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735753326.6754909,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4235744.834529697,
+ "end_time": 4236890.418296373,
+ "total_evaluation_time_seconds": "1145.5837666764855"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/gat_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5d779bdb9d5caf44cc87872fcaee8f308a0d7987
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.39150884234290734,
+ "acc_stderr,none": 0.0037870650562161724,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.30871985157699444,
+ "acc_stderr,none": 0.008900420500465429
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.008999154119267206
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.31947000368053,
+ "acc_stderr,none": 0.008946925003650451
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.48038277511961724,
+ "acc_stderr,none": 0.015462696567602829
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.3401639344262295,
+ "acc_stderr,none": 0.013569389383985758
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.5776859504132231,
+ "acc_stderr,none": 0.014205303507223562
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.37806748466257667,
+ "acc_stderr,none": 0.013433342491211057
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.28493150684931506,
+ "acc_stderr,none": 0.023658835631635913
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5349716446124764,
+ "acc_stderr,none": 0.009700058955969343
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.39150884234290734,
+ "acc_stderr,none": 0.0037870650562161724,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.8,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735741378.0475895,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 156286.643002293,
+ "end_time": 157115.263612495,
+ "total_evaluation_time_seconds": "828.6206102019933"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a27e5af3ec26e374ba12a9a9fff3f52ddd825c71
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.7451451451451452,
+ "acc_stderr,none": 0.004360194744412726,
+ "acc_norm,none": 0.7451451451451452,
+ "acc_norm_stderr,none": 0.004360194744412726
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738751017.0602386,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_mcq": "10880f503e175cc1732ea242e62a05f551ab3037c2343137caef8ccae9b636d6"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 743792.167701501,
+ "end_time": 745208.032451816,
+ "total_evaluation_time_seconds": "1415.8647503149696"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..537c18bb03d1b584ffbd58ac72ee007e3aaa1951
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7647260862098575,
+ "acc_stderr,none": 0.005559090451740826,
+ "acc_norm,none": 0.7647260862098575,
+ "acc_norm_stderr,none": 0.005559090451740826
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 69500936192,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "07c93d6799cba82e240633e5fc9bb4cceea6feb2",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738752498.2153778,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_tf": "944b34dde7f12f68b21e22312c06a9cdc68419df98db10d8e947f07ff8680ed0"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 745273.350897887,
+ "end_time": 746075.048487207,
+ "total_evaluation_time_seconds": "801.6975893200142"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-70b-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-adapted-70b-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..02d0da7a70a4ec90b109cf165320f03c78778bbe
--- /dev/null
+++ b/evaluations/ar/jais-adapted-70b-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2649 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.47980132450331126,
+ "acc_stderr,none": 0.008824818939843108,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.04725815626252604
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.6118421052631579,
+ "acc_stderr,none": 0.03965842097512744
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5833333333333334,
+ "acc_stderr,none": 0.04122728707651282
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001975
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.04975698519562428
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.29,
+ "acc_stderr,none": 0.04560480215720683
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.3137254901960784,
+ "acc_stderr,none": 0.046170348270067184
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.5404255319148936,
+ "acc_stderr,none": 0.03257901482099835
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.34210526315789475,
+ "acc_stderr,none": 0.04462917535336936
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.496551724137931,
+ "acc_stderr,none": 0.041665675771015785
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.42592592592592593,
+ "acc_stderr,none": 0.02546714904546955
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.7096774193548387,
+ "acc_stderr,none": 0.025822106119415898
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.4975369458128079,
+ "acc_stderr,none": 0.03517945038691063
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.337037037037037,
+ "acc_stderr,none": 0.028820884666253255
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.3576158940397351,
+ "acc_stderr,none": 0.03913453431177258
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.47685185185185186,
+ "acc_stderr,none": 0.03406315360711507
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.7045454545454546,
+ "acc_stderr,none": 0.010623479338923845,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7575757575757576,
+ "acc_stderr,none": 0.03346409881055953
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7843137254901961,
+ "acc_stderr,none": 0.028867431449849303
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.810126582278481,
+ "acc_stderr,none": 0.025530100460233504
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7355371900826446,
+ "acc_stderr,none": 0.040261875275912046
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.6574074074074074,
+ "acc_stderr,none": 0.04587904741301812
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6748466257668712,
+ "acc_stderr,none": 0.036803503712864616
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.684887459807074,
+ "acc_stderr,none": 0.026385273703464496
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.5802469135802469,
+ "acc_stderr,none": 0.027460099557005138
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7192982456140351,
+ "acc_stderr,none": 0.034462962170884265
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5701281186783547,
+ "acc_stderr,none": 0.006240310572749657,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.5259259259259259,
+ "acc_stderr,none": 0.04313531696750575
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5773584905660377,
+ "acc_stderr,none": 0.03040233144576954
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.5086705202312138,
+ "acc_stderr,none": 0.03811890988940412
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.3412698412698413,
+ "acc_stderr,none": 0.04240799327574925
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.04999999999999999
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7474747474747475,
+ "acc_stderr,none": 0.030954055470365907
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.7522935779816514,
+ "acc_stderr,none": 0.018508143602547815
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6591928251121076,
+ "acc_stderr,none": 0.0318114974705536
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.4642857142857143,
+ "acc_stderr,none": 0.04733667890053757
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.65,
+ "acc_stderr,none": 0.04793724854411019
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.719029374201788,
+ "acc_stderr,none": 0.016073127851221235
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6830065359477124,
+ "acc_stderr,none": 0.026643278474508755
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.425531914893617,
+ "acc_stderr,none": 0.029494827600144366
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.4517601043024772,
+ "acc_stderr,none": 0.012710662233660247
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4963235294117647,
+ "acc_stderr,none": 0.0303720158854282
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.576797385620915,
+ "acc_stderr,none": 0.019987809769482064
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5240963855421686,
+ "acc_stderr,none": 0.03887971849597264
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5709068776628119,
+ "acc_stderr,none": 0.007959901709763195,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.7,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8134715025906736,
+ "acc_stderr,none": 0.02811209121011747
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.5871794871794872,
+ "acc_stderr,none": 0.024962683564331796
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.5966386554621849,
+ "acc_stderr,none": 0.031866081214088314
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7251908396946565,
+ "acc_stderr,none": 0.03915345408847835
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6990291262135923,
+ "acc_stderr,none": 0.045416094465039476
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.782051282051282,
+ "acc_stderr,none": 0.02704685763071667
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.661849710982659,
+ "acc_stderr,none": 0.02546977014940017
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.27262569832402234,
+ "acc_stderr,none": 0.01489339173524962
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5818181818181818,
+ "acc_stderr,none": 0.04724577405731571
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6816326530612244,
+ "acc_stderr,none": 0.029822533793982055
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.7164179104477612,
+ "acc_stderr,none": 0.03187187537919798
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.04020151261036844
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.47980132450331126,
+ "acc_stderr,none": 0.008824818939843108,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.7045454545454546,
+ "acc_stderr,none": 0.010623479338923845,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5701281186783547,
+ "acc_stderr,none": 0.006240310572749657,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5709068776628119,
+ "acc_stderr,none": 0.007959901709763195,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_international_law",
+ "openaimmlu_world_religions",
+ "openaimmlu_prehistory",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_european_history"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_public_relations",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_security_studies",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_sociology",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_management",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_marketing",
+ "openaimmlu_business_ethics"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_anatomy",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_professional_law",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_global_facts",
+ "openaimmlu_virology",
+ "openaimmlu_nutrition",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_college_medicine",
+ "openaimmlu_machine_learning",
+ "openaimmlu_human_aging",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_formal_logic",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_medicine"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_college_biology",
+ "openaimmlu_computer_security",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_conceptual_physics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-70b-chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735756107.204563,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-70b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-70b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4238525.433585406,
+ "end_time": 4239500.613676238,
+ "total_evaluation_time_seconds": "975.1800908315927"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/acva_5_shot.json b/evaluations/ar/jais-adapted-7b-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ec7af286a6552e62ab5b24ef29568e4a25716380
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7163030998851895,
+ "acc_stderr,none": 0.004830494202743803,
+ "acc_norm,none": 0.7043628013777268,
+ "acc_norm_stderr,none": 0.004889828190051208
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736967182.7463732,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 2371.207720225,
+ "end_time": 3202.344691831,
+ "total_evaluation_time_seconds": "831.1369716060003"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..00d077ab91f5c40933db4a017c70c789fd9b8cf7
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.14925373134328357,
+ "prompt_level_strict_acc_stderr,none": 0.015405852451693323,
+ "inst_level_strict_acc,none": 0.5426621160409556,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.17723880597014927,
+ "prompt_level_loose_acc_stderr,none": 0.016509708932173617,
+ "inst_level_loose_acc,none": 0.578839590443686,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619028.4068084,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "ar_ifeval": "4b20e2959680620fd181f30d91c0274af9a3e1cc023b746ee5e02809d7d45954"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1982642.64143783,
+ "end_time": 1986984.51241685,
+ "total_evaluation_time_seconds": "4341.870979020139"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-adapted-7b-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..173b7ffbf1d6b4996deb8512c8e20c63cf617ddc
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.28429752066115704,
+ "acc_stderr,none": 0.01835415215519967,
+ "acc_norm,none": 0.28429752066115704,
+ "acc_norm_stderr,none": 0.01835415215519967
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618892.533642,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araMath_v3": "b3fe722cebee19d37f6462a65a71854be30c8fada0a636e26fe49e070b49d07e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1982507.115611266,
+ "end_time": 1982583.278987088,
+ "total_evaluation_time_seconds": "76.1633758218959"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/araPro_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3cd75bfc20b65d2660e3ebaeb7552525942c64ec
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5058988202359528,
+ "acc_stderr,none": 0.007070575703856374,
+ "acc_norm,none": 0.5058988202359528,
+ "acc_norm_stderr,none": 0.007070575703856374
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739617069.9442637,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "araPro": "ecf84d12784310b52b252574c7d56efbe3005c09fb41c792c4fa6a74fcae7239"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1980684.567416227,
+ "end_time": 1981571.878844224,
+ "total_evaluation_time_seconds": "887.3114279969595"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..464053549893c109fc11d87dd2cead3929673407
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.4975441023867174,
+ "acc_stderr,none": 0.004073384874245624,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5173649393605292,
+ "acc_stderr,none": 0.008059301844728773,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.3671052631578947,
+ "acc_stderr,none": 0.01749605598016935
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.5329341317365269,
+ "acc_stderr,none": 0.027340327767287394
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.5384615384615384,
+ "acc_stderr,none": 0.0808703820058226
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.3974960876369327,
+ "acc_stderr,none": 0.019374746350863278
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.5812807881773399,
+ "acc_stderr,none": 0.03471192860518469
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6008403361344538,
+ "acc_stderr,none": 0.031811100324139245
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.49019607843137253,
+ "acc_stderr,none": 0.04974229460422817
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.6726726726726727,
+ "acc_stderr,none": 0.014853464205696236
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.5159235668789809,
+ "acc_stderr,none": 0.028247335253768956
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5018226002430134,
+ "acc_stderr,none": 0.012147423836099071,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.5833333333333334,
+ "acc_stderr,none": 0.01994491413687358
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5178082191780822,
+ "acc_stderr,none": 0.02619049337476246
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.35384615384615387,
+ "acc_stderr,none": 0.024243783994062167
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.5925925925925926,
+ "acc_stderr,none": 0.09636202008710973
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.031559720154890156
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.5233494363929146,
+ "acc_stderr,none": 0.009987155759790199,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.5408753096614368,
+ "acc_stderr,none": 0.014325876981508813
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.4664351851851852,
+ "acc_stderr,none": 0.016981804836010583
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.5581395348837209,
+ "acc_stderr,none": 0.03797658515942914
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6234567901234568,
+ "acc_stderr,none": 0.038185427041450865
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.05694947974514993
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.4877283105022831,
+ "acc_stderr,none": 0.00829476633798559,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.367816091954023,
+ "acc_stderr,none": 0.05199814559011102
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.49166666666666664,
+ "acc_stderr,none": 0.026385325306307095
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.3978805394990366,
+ "acc_stderr,none": 0.015199465039911994
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4152542372881356,
+ "acc_stderr,none": 0.03214449793774544
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.735632183908046,
+ "acc_stderr,none": 0.04755382188278442
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.47794117647058826,
+ "acc_stderr,none": 0.030343264224213514
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.43568464730290457,
+ "acc_stderr,none": 0.032006739876642154
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.5263157894736842,
+ "acc_stderr,none": 0.06672270432067239
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.6411347517730497,
+ "acc_stderr,none": 0.018078151909972997
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.4864864864864865,
+ "acc_stderr,none": 0.05849919621886871
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.49635036496350365,
+ "acc_stderr,none": 0.04287350410390777
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.49523809523809526,
+ "acc_stderr,none": 0.034584154644211426
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.46351393673661134,
+ "acc_stderr,none": 0.00858845350484014,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.3860894251242016,
+ "acc_stderr,none": 0.012974636011804944
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.4827586206896552,
+ "acc_stderr,none": 0.030990242561135053
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.30196078431372547,
+ "acc_stderr,none": 0.02880701939354399
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.09245003270420485
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.5826446280991735,
+ "acc_stderr,none": 0.031764816874392546
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6631578947368421,
+ "acc_stderr,none": 0.03437880340748323
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.44987775061124696,
+ "acc_stderr,none": 0.024629000128784228
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.6845238095238095,
+ "acc_stderr,none": 0.02538955971347752
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.53125,
+ "acc_stderr,none": 0.06287092313773097
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.4975441023867174,
+ "acc_stderr,none": 0.004073384874245624,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5173649393605292,
+ "acc_stderr,none": 0.008059301844728773,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5018226002430134,
+ "acc_stderr,none": 0.012147423836099071,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.5233494363929146,
+ "acc_stderr,none": 0.009987155759790199,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.4877283105022831,
+ "acc_stderr,none": 0.00829476633798559,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.46351393673661134,
+ "acc_stderr,none": 0.00858845350484014,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_middle_natural_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_prof_law"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_high_economics"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_middle_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735749990.730385,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4050.237020402,
+ "end_time": 4482.328043771,
+ "total_evaluation_time_seconds": "432.09102336900014"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/etec_v2_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9df257384e39727ac4dbb0fbaa6cf21a8ba422ad
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.40487546369899313,
+ "acc_stderr,none": 0.011303002223987405,
+ "acc_norm,none": 0.40487546369899313,
+ "acc_norm_stderr,none": 0.011303002223987405
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618018.0630515,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "etec_v2": "96d83c3dfc0ddb3d56ef40f620488675ad72862342308d216d4140d7d20ecd38"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1981632.798806175,
+ "end_time": 1981734.653376021,
+ "total_evaluation_time_seconds": "101.85456984606571"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/exams_ar_5_shot.json b/evaluations/ar/jais-adapted-7b-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..64ed8fd051b01ae21152b97ff71a81f2b783b771
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4059590316573557,
+ "acc_stderr,none": 0.021211281507636986,
+ "acc_norm,none": 0.4059590316573557,
+ "acc_norm_stderr,none": 0.021211281507636986
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737021909.6242902,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1533.092145855,
+ "end_time": 2256.740809025,
+ "total_evaluation_time_seconds": "723.6486631700002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/gat_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f7768fe4158f110e9f6398355658b9b400a2394c
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.2967515364354697,
+ "acc_stderr,none": 0.003604585447272368,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.24749536178107606,
+ "acc_stderr,none": 0.008314561061258798
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.2790528233151184,
+ "acc_stderr,none": 0.008562545250353257
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.2800883327199117,
+ "acc_stderr,none": 0.00861632818616305
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.3196172248803828,
+ "acc_stderr,none": 0.01443249760130354
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.3590163934426229,
+ "acc_stderr,none": 0.01373974739490732
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.30082644628099175,
+ "acc_stderr,none": 0.013189773951403421
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.27070552147239263,
+ "acc_stderr,none": 0.012309142853473802
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.25753424657534246,
+ "acc_stderr,none": 0.02291949350361232
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.3610586011342155,
+ "acc_stderr,none": 0.009340898141734538
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.2967515364354697,
+ "acc_stderr,none": 0.003604585447272368,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735749532.8652654,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3592.214233832,
+ "end_time": 4020.148395127,
+ "total_evaluation_time_seconds": "427.9341612950002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..27a91606e395d161f706b63c76ef67d58e68dd52
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.5737737737737738,
+ "acc_stderr,none": 0.004947996965610276,
+ "acc_norm,none": 0.5737737737737738,
+ "acc_norm_stderr,none": 0.004947996965610276
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618162.2068646,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_mcq": "64c1f30e4acb02ea085279bfa8affcb9f9f8f00136eb0d89b2fd705e17950843"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1981776.82187215,
+ "end_time": 1982168.095300103,
+ "total_evaluation_time_seconds": "391.2734279530123"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..393e3a109138baca0c456090226f60b80b57617e
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6718186501803194,
+ "acc_stderr,none": 0.006153849572169566,
+ "acc_norm,none": 0.6718186501803194,
+ "acc_norm_stderr,none": 0.006153849572169566
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739618613.2639303,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "moe_ien_tf": "1b5f087aef767b97dbc9faaaacace59a2c0298137e4e95b34f3a681282d72c46"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1982227.943355788,
+ "end_time": 1982447.325638794,
+ "total_evaluation_time_seconds": "219.38228300609626"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-adapted-7b-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-adapted-7b-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ff0c7c6c79009420661694d607473ded28e8782d
--- /dev/null
+++ b/evaluations/ar/jais-adapted-7b-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.3854151830223615,
+ "acc_stderr,none": 0.004031384548470796,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.3258278145695364,
+ "acc_stderr,none": 0.008457779824528174,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.26,
+ "acc_stderr,none": 0.04408440022768077
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.4276315789473684,
+ "acc_stderr,none": 0.04026097083296558
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.4097222222222222,
+ "acc_stderr,none": 0.04112490974670787
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.22,
+ "acc_stderr,none": 0.04163331998932269
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.04824181513244218
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.22549019607843138,
+ "acc_stderr,none": 0.041583075330832865
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.49,
+ "acc_stderr,none": 0.05024183937956912
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.3191489361702128,
+ "acc_stderr,none": 0.030472973363380045
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.2894736842105263,
+ "acc_stderr,none": 0.04266339443159394
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.040824829046386284
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.2804232804232804,
+ "acc_stderr,none": 0.023135287974325628
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.3741935483870968,
+ "acc_stderr,none": 0.027528904299845777
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.3694581280788177,
+ "acc_stderr,none": 0.03395970381998575
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.049756985195624284
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.027940457136228402
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.271523178807947,
+ "acc_stderr,none": 0.03631329803969654
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.2361111111111111,
+ "acc_stderr,none": 0.02896370257079102
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.4861419068736142,
+ "acc_stderr,none": 0.011703480584172478,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.5151515151515151,
+ "acc_stderr,none": 0.039025510073744475
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.45588235294117646,
+ "acc_stderr,none": 0.034956245220154746
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.5991561181434599,
+ "acc_stderr,none": 0.031900803894732356
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.5867768595041323,
+ "acc_stderr,none": 0.04495087843548408
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.48148148148148145,
+ "acc_stderr,none": 0.04830366024635331
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.43558282208588955,
+ "acc_stderr,none": 0.03895632464138937
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.43729903536977494,
+ "acc_stderr,none": 0.028173917761762878
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.42592592592592593,
+ "acc_stderr,none": 0.027513747284379424
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.52046783625731,
+ "acc_stderr,none": 0.038316105328219316
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3792987188132165,
+ "acc_stderr,none": 0.006232325281499182,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4222222222222222,
+ "acc_stderr,none": 0.04266763404099582
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.3622641509433962,
+ "acc_stderr,none": 0.0295822451283843
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.3179190751445087,
+ "acc_stderr,none": 0.0355068398916558
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.29365079365079366,
+ "acc_stderr,none": 0.040735243221471255
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.04852365870939098
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.4797979797979798,
+ "acc_stderr,none": 0.03559443565563919
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.42018348623853213,
+ "acc_stderr,none": 0.021162420048273515
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.47085201793721976,
+ "acc_stderr,none": 0.03350073248773404
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.375,
+ "acc_stderr,none": 0.04595091388086298
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.38,
+ "acc_stderr,none": 0.048783173121456316
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.5057471264367817,
+ "acc_stderr,none": 0.017878782326129227
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.4542483660130719,
+ "acc_stderr,none": 0.02850980780262657
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.028121636040639882
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3239895697522816,
+ "acc_stderr,none": 0.011952840809646566
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.22058823529411764,
+ "acc_stderr,none": 0.025187786660227265
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.34967320261437906,
+ "acc_stderr,none": 0.01929196189506638
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.35542168674698793,
+ "acc_stderr,none": 0.03726214354322415
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.3959220937309799,
+ "acc_stderr,none": 0.00827574379380361,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001974
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.39378238341968913,
+ "acc_stderr,none": 0.03526077095548237
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.34615384615384615,
+ "acc_stderr,none": 0.024121125416941183
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.3445378151260504,
+ "acc_stderr,none": 0.030868682604121633
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.4732824427480916,
+ "acc_stderr,none": 0.04379024936553894
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.44660194174757284,
+ "acc_stderr,none": 0.04922424153458933
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.5982905982905983,
+ "acc_stderr,none": 0.03211693751051621
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.4797687861271676,
+ "acc_stderr,none": 0.026897049996382875
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.24022346368715083,
+ "acc_stderr,none": 0.014288343803925307
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.0469237132203465
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5061224489795918,
+ "acc_stderr,none": 0.032006820201639086
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.5373134328358209,
+ "acc_stderr,none": 0.03525675167467974
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.04824181513244218
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.3854151830223615,
+ "acc_stderr,none": 0.004031384548470796,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.3258278145695364,
+ "acc_stderr,none": 0.008457779824528174,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.4861419068736142,
+ "acc_stderr,none": 0.011703480584172478,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.3792987188132165,
+ "acc_stderr,none": 0.006232325281499182,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.3959220937309799,
+ "acc_stderr,none": 0.00827574379380361,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_international_law",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_european_history"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_marketing",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_business_ethics",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_security_studies",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_public_relations",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_management",
+ "openaimmlu_sociology",
+ "openaimmlu_human_sexuality"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_law",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_nutrition",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_formal_logic",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_college_medicine",
+ "openaimmlu_human_aging",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_anatomy",
+ "openaimmlu_global_facts",
+ "openaimmlu_machine_learning",
+ "openaimmlu_virology"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_high_school_physics",
+ "openaimmlu_college_biology",
+ "openaimmlu_computer_security",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_astronomy",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_conceptual_physics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-adapted-7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "f2de64b06baedc5546928fbdea10fca517f7cbc7",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736968038.6495116,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-adapted-7b-chat",
+ "model_name_sanitized": "inceptionai__jais-adapted-7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3227.626114991,
+ "end_time": 3509.415462885,
+ "total_evaluation_time_seconds": "281.789347894"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/acva_5_shot.json b/evaluations/ar/jais-family-13b-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c97a9b5e568c3cd1113e2e36c9ece86466120b47
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/acva_5_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7151549942594718,
+ "acc_stderr,none": 0.004836378115069638,
+ "acc_norm,none": 0.711825487944891,
+ "acc_norm_stderr,none": 0.004853224766783267
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 0.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13027571240,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "0ef8b4f80429609890816d912b331d3b95864707",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736969414.0827904,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4160.406427698,
+ "end_time": 5672.598217492,
+ "total_evaluation_time_seconds": "1512.1917897940002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-family-13b-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b012da0a475eb8bb437bea6e0a3686fd5ed7c0be
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,138 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.17164179104477612,
+ "prompt_level_strict_acc_stderr,none": 0.01630210620024172,
+ "inst_level_strict_acc,none": 0.5426621160409556,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.208955223880597,
+ "prompt_level_loose_acc_stderr,none": 0.017577222851338593,
+ "inst_level_loose_acc,none": 0.5870307167235495,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=4,data_parallel_size=2,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738654510.3400126,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "ar_ifeval": "4b20e2959680620fd181f30d91c0274af9a3e1cc023b746ee5e02809d7d45954"
+ },
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 429194.858034011,
+ "end_time": 429654.537159294,
+ "total_evaluation_time_seconds": "459.67912528302986"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-family-13b-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..646bf90ce6cde0bdfc4a6a234854dbbd2a35ab6a
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/araMath_v3_5_shot.json
@@ -0,0 +1,122 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.26611570247933886,
+ "acc_stderr,none": 0.017981693016247826,
+ "acc_norm,none": 0.26611570247933886,
+ "acc_norm_stderr,none": 0.017981693016247826
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738675314.717633,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "araMath_v3": "b3fe722cebee19d37f6462a65a71854be30c8fada0a636e26fe49e070b49d07e"
+ },
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 529237.504818623,
+ "end_time": 529350.764209511,
+ "total_evaluation_time_seconds": "113.25939088803716"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/araPro_0_shot.json b/evaluations/ar/jais-family-13b-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5649e4d73bf050a5d15beb56d97853f259321c29
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5752849430113978,
+ "acc_stderr,none": 0.00699045316636581,
+ "acc_norm,none": 0.5752849430113978,
+ "acc_norm_stderr,none": 0.00699045316636581
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13027571240,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "0ef8b4f80429609890816d912b331d3b95864707",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738745497.5500338,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "araPro": "ecf84d12784310b52b252574c7d56efbe3005c09fb41c792c4fa6a74fcae7239"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 744617.512701132,
+ "end_time": 746248.251551348,
+ "total_evaluation_time_seconds": "1630.738850216032"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-family-13b-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..44329d0d977c21762441d6f1790a53b3f2cf86ea
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2045 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5813905223106192,
+ "acc_stderr,none": 0.003974457419363176,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6207276736493936,
+ "acc_stderr,none": 0.007676866448419673,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.4605263157894737,
+ "acc_stderr,none": 0.01809220376192219
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6167664670658682,
+ "acc_stderr,none": 0.026642195538092498
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.07647191129018725
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.6071987480438185,
+ "acc_stderr,none": 0.01933488200369804
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6650246305418719,
+ "acc_stderr,none": 0.033208527423483104
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6428571428571429,
+ "acc_stderr,none": 0.031124619309328177
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6862745098039216,
+ "acc_stderr,none": 0.04617034827006718
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8138138138138138,
+ "acc_stderr,none": 0.012321710081733966
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.3535031847133758,
+ "acc_stderr,none": 0.027021390361997532
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5595382746051033,
+ "acc_stderr,none": 0.011907567989279312,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6748366013071896,
+ "acc_stderr,none": 0.018950886770806315
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5287671232876713,
+ "acc_stderr,none": 0.02616370969480108
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.37435897435897436,
+ "acc_stderr,none": 0.024537591572830496
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.5185185185185185,
+ "acc_stderr,none": 0.09799078929868857
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.6150793650793651,
+ "acc_stderr,none": 0.03071243955075999
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.645330112721417,
+ "acc_stderr,none": 0.009605570074720063,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6457473162675474,
+ "acc_stderr,none": 0.013749762426221467
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6516203703703703,
+ "acc_stderr,none": 0.01621878455756233
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6162790697674418,
+ "acc_stderr,none": 0.03718762118238795
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6604938271604939,
+ "acc_stderr,none": 0.03732031330740126
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.05694947974514993
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.560216894977169,
+ "acc_stderr,none": 0.00821187595080662,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4482758620689655,
+ "acc_stderr,none": 0.053627116270410544
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5916666666666667,
+ "acc_stderr,none": 0.02594171859862409
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.4527938342967245,
+ "acc_stderr,none": 0.015457397136918143
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4957627118644068,
+ "acc_stderr,none": 0.032615232401979485
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7241379310344828,
+ "acc_stderr,none": 0.04819560289115228
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6360294117647058,
+ "acc_stderr,none": 0.029227192460032025
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.4896265560165975,
+ "acc_stderr,none": 0.0322679143822933
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7017543859649122,
+ "acc_stderr,none": 0.061134390564663986
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7163120567375887,
+ "acc_stderr,none": 0.01698968161579803
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5540540540540541,
+ "acc_stderr,none": 0.058177592923397636
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5401459854014599,
+ "acc_stderr,none": 0.04273622067714666
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5238095238095238,
+ "acc_stderr,none": 0.034546488100476766
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5214531788286878,
+ "acc_stderr,none": 0.008539561905594092,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.42086586231369766,
+ "acc_stderr,none": 0.013157097879519403
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5478927203065134,
+ "acc_stderr,none": 0.030866105840801246
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.38823529411764707,
+ "acc_stderr,none": 0.03057897034303606
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.6735537190082644,
+ "acc_stderr,none": 0.030205321356519606
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6894736842105263,
+ "acc_stderr,none": 0.03365713545671698
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5134474327628362,
+ "acc_stderr,none": 0.024744734365196468
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.7767857142857143,
+ "acc_stderr,none": 0.022750408778833355
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.6875,
+ "acc_stderr,none": 0.058397074018894594
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5813905223106192,
+ "acc_stderr,none": 0.003974457419363176,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6207276736493936,
+ "acc_stderr,none": 0.007676866448419673,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5595382746051033,
+ "acc_stderr,none": 0.011907567989279312,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.645330112721417,
+ "acc_stderr,none": 0.009605570074720063,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.560216894977169,
+ "acc_stderr,none": 0.00821187595080662,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5214531788286878,
+ "acc_stderr,none": 0.008539561905594092,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_middle_arabic_language"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_high_biology",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_middle_natural_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_high_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_univ_accounting"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_driving_test"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735755943.4155445,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 9749.039771719,
+ "end_time": 10388.251187622,
+ "total_evaluation_time_seconds": "639.2114159029989"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/etec_v2_0_shot.json b/evaluations/ar/jais-family-13b-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..aad44c0b50bec7137515d4fa8436558214a2d9eb
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/etec_v2_0_shot.json
@@ -0,0 +1,122 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.4864864864864865,
+ "acc_stderr,none": 0.011509076711033886,
+ "acc_norm,none": 0.4864864864864865,
+ "acc_norm_stderr,none": 0.011509076711033886
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738681928.5301642,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "etec_v2": "96d83c3dfc0ddb3d56ef40f620488675ad72862342308d216d4140d7d20ecd38"
+ },
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1056788.20809773,
+ "end_time": 1057190.65877355,
+ "total_evaluation_time_seconds": "402.45067582000047"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/exams_ar_5_shot.json b/evaluations/ar/jais-family-13b-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7b3d8958c75c6eb5bed5d623c2909216b65c53d1
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4506517690875233,
+ "acc_stderr,none": 0.021491266540407467,
+ "acc_norm,none": 0.4506517690875233,
+ "acc_norm_stderr,none": 0.021491266540407467
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 13027571240,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "0ef8b4f80429609890816d912b331d3b95864707",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737023418.5168922,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3042.082462715,
+ "end_time": 4392.50396786,
+ "total_evaluation_time_seconds": "1350.4215051449996"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/gat_0_shot.json b/evaluations/ar/jais-family-13b-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9482260c5f1de2d5bf053fa6353b2a3438b02c58
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/gat_0_shot.json
@@ -0,0 +1,539 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.31719553493039004,
+ "acc_stderr,none": 0.0036673800264634595,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.3484230055658627,
+ "acc_stderr,none": 0.009179890200725068
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.2837887067395264,
+ "acc_stderr,none": 0.008606490293380746
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.25653294074346705,
+ "acc_stderr,none": 0.008379875233626235
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.39617224880382773,
+ "acc_stderr,none": 0.015137296245565172
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.28770491803278686,
+ "acc_stderr,none": 0.012965872987333184
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.3371900826446281,
+ "acc_stderr,none": 0.013596237583820002
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.27223926380368096,
+ "acc_stderr,none": 0.012330976880474218
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.3287671232876712,
+ "acc_stderr,none": 0.02462238450062787
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.3761814744801512,
+ "acc_stderr,none": 0.009421002319111672
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.31719553493039004,
+ "acc_stderr,none": 0.0036673800264634595,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735755270.1942198,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 9075.762825732,
+ "end_time": 9718.924999701,
+ "total_evaluation_time_seconds": "643.1621739689999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-family-13b-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ed704e9225586f4c355ad01da6d08f02d260a011
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.6295295295295296,
+ "acc_stderr,none": 0.004831965726290136,
+ "acc_norm,none": 0.6295295295295296,
+ "acc_norm_stderr,none": 0.004831965726290136
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738674575.1485074,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "moe_ien_mcq": "64c1f30e4acb02ea085279bfa8affcb9f9f8f00136eb0d89b2fd705e17950843"
+ },
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 528498.062782709,
+ "end_time": 528709.370624047,
+ "total_evaluation_time_seconds": "211.30784133798443"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-family-13b-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8fac443472d9acf69de3a6ffe8375431663504b6
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6867594023699124,
+ "acc_stderr,none": 0.006078623271522227,
+ "acc_norm,none": 0.6867594023699124,
+ "acc_norm_stderr,none": 0.006078623271522227
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp,enforce_eager=False",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738682397.1412141,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "moe_ien_tf": "1b5f087aef767b97dbc9faaaacace59a2c0298137e4e95b34f3a681282d72c46"
+ },
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 1057256.697234494,
+ "end_time": 1057380.72616096,
+ "total_evaluation_time_seconds": "124.028926466126"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-13b-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-family-13b-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b1764844d9b949d441c208e5d841db07407d2d62
--- /dev/null
+++ b/evaluations/ar/jais-family-13b-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2656 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.47728243839908846,
+ "acc_stderr,none": 0.004075228135853262,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.40066225165562913,
+ "acc_stderr,none": 0.008735985110676752,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5197368421052632,
+ "acc_stderr,none": 0.040657710025626036
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5277777777777778,
+ "acc_stderr,none": 0.04174752578923185
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695236
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.46,
+ "acc_stderr,none": 0.05009082659620333
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.29,
+ "acc_stderr,none": 0.04560480215720684
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.04690650298201943
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.57,
+ "acc_stderr,none": 0.04975698519562428
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.3872340425531915,
+ "acc_stderr,none": 0.03184389265339526
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.044346007015849245
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4689655172413793,
+ "acc_stderr,none": 0.04158632762097828
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.3412698412698413,
+ "acc_stderr,none": 0.02441923496681907
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5838709677419355,
+ "acc_stderr,none": 0.028040981380761543
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.4236453201970443,
+ "acc_stderr,none": 0.034767257476490364
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.49,
+ "acc_stderr,none": 0.05024183937956912
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.29259259259259257,
+ "acc_stderr,none": 0.02773896963217609
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.33112582781456956,
+ "acc_stderr,none": 0.038425817186598696
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.27314814814814814,
+ "acc_stderr,none": 0.030388051301678116
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6003325942350333,
+ "acc_stderr,none": 0.011449323544037743,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.6909090909090909,
+ "acc_stderr,none": 0.036085410115739666
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6323529411764706,
+ "acc_stderr,none": 0.03384132045674118
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.6835443037974683,
+ "acc_stderr,none": 0.03027497488021898
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6446280991735537,
+ "acc_stderr,none": 0.0436923632657398
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.04803752235190192
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.5521472392638037,
+ "acc_stderr,none": 0.03906947479456606
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.5530546623794212,
+ "acc_stderr,none": 0.028237769422085335
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.5061728395061729,
+ "acc_stderr,none": 0.027818623962583302
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.036155076303109344
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.48128792987188135,
+ "acc_stderr,none": 0.006333441327132957,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.04292596718256981
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5509433962264151,
+ "acc_stderr,none": 0.030612730713641095
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.49710982658959535,
+ "acc_stderr,none": 0.038124005659748335
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.35714285714285715,
+ "acc_stderr,none": 0.04285714285714281
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.04688261722621504
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.6515151515151515,
+ "acc_stderr,none": 0.033948539651564025
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.6220183486238532,
+ "acc_stderr,none": 0.020789187066728106
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.547085201793722,
+ "acc_stderr,none": 0.033408675019233246
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.375,
+ "acc_stderr,none": 0.04595091388086298
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.04923659639173309
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.6257982120051085,
+ "acc_stderr,none": 0.01730480507225203
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5359477124183006,
+ "acc_stderr,none": 0.02855582751652878
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.37943262411347517,
+ "acc_stderr,none": 0.028947338851614105
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3500651890482399,
+ "acc_stderr,none": 0.012182552313215175
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4338235294117647,
+ "acc_stderr,none": 0.030105636570016633
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.4869281045751634,
+ "acc_stderr,none": 0.020220920829626912
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4819277108433735,
+ "acc_stderr,none": 0.03889951252827216
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.472915398660986,
+ "acc_stderr,none": 0.008280814440523745,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.58,
+ "acc_stderr,none": 0.049604496374885836
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.694300518134715,
+ "acc_stderr,none": 0.033248379397581594
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.4846153846153846,
+ "acc_stderr,none": 0.025339003010106515
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.42436974789915966,
+ "acc_stderr,none": 0.032104790510157764
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6106870229007634,
+ "acc_stderr,none": 0.04276486542814591
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.5825242718446602,
+ "acc_stderr,none": 0.048828405482122375
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6196581196581197,
+ "acc_stderr,none": 0.03180425204384099
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.5433526011560693,
+ "acc_stderr,none": 0.026817718130348916
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.24022346368715083,
+ "acc_stderr,none": 0.014288343803925315
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.44545454545454544,
+ "acc_stderr,none": 0.047605488214603246
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5836734693877551,
+ "acc_stderr,none": 0.03155782816556165
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6218905472636815,
+ "acc_stderr,none": 0.034288678487786564
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.047258156262526094
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.47728243839908846,
+ "acc_stderr,none": 0.004075228135853262,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.40066225165562913,
+ "acc_stderr,none": 0.008735985110676752,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6003325942350333,
+ "acc_stderr,none": 0.011449323544037743,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.48128792987188135,
+ "acc_stderr,none": 0.006333441327132957,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.472915398660986,
+ "acc_stderr,none": 0.008280814440523745,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_philosophy",
+ "openaimmlu_international_law",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_prehistory",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_european_history"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_management",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_public_relations",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_security_studies",
+ "openaimmlu_business_ethics",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_marketing"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_law",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_machine_learning",
+ "openaimmlu_human_aging",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_anatomy",
+ "openaimmlu_college_medicine",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_global_facts",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_nutrition",
+ "openaimmlu_formal_logic",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_virology",
+ "openaimmlu_professional_accounting"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_college_mathematics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_college_biology",
+ "openaimmlu_computer_security",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_college_physics",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_mathematics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=inceptionai/jais-family-13b-chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735754494.9131842,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "inceptionai/jais-family-13b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-13b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 8300.499232358,
+ "end_time": 9045.254644093,
+ "total_evaluation_time_seconds": "744.7554117349991"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/acva_5_shot.json b/evaluations/ar/jais-family-30b-16k-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..4edf192c92c3ff719f242e5c8c7fc85c630b482f
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.6070034443168771,
+ "acc_stderr,none": 0.005233663601030597,
+ "acc_norm,none": 0.6008036739380023,
+ "acc_norm_stderr,none": 0.005247777491288741
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737023003.255661,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 879299.652245392,
+ "end_time": 879911.507597097,
+ "total_evaluation_time_seconds": "611.8553517049877"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..951dcb4ea80eb12fbd59d644ab0cc87a26e2c815
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.166044776119403,
+ "prompt_level_strict_acc_stderr,none": 0.01608818620625759,
+ "inst_level_strict_acc,none": 0.5494880546075085,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.19402985074626866,
+ "prompt_level_loose_acc_stderr,none": 0.0170968799561458,
+ "inst_level_loose_acc,none": 0.5781569965870307,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738753223.889612,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "ar_ifeval": "09fb0c6580f0a42624590f94c9483581a566f54a07cf60f59a60d159e4c054e2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 129601.36739099,
+ "end_time": 136220.738703003,
+ "total_evaluation_time_seconds": "6619.371312013012"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-family-30b-16k-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9dca32f780af744611587d5ee9296aecccb8d962
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.41487603305785126,
+ "acc_stderr,none": 0.02004770429343817,
+ "acc_norm,none": 0.41487603305785126,
+ "acc_norm_stderr,none": 0.02004770429343817
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738749362.5629075,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "araMath_v3": "d0d66a51e36e6cb52cf906fef452bc518aad1a1e641c82f522dc8014f42cc48e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 125739.990717701,
+ "end_time": 125933.227370466,
+ "total_evaluation_time_seconds": "193.23665276500105"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/araPro_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..300854e394dbef3176441c39348ed6a8c61d4d72
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.627874425114977,
+ "acc_stderr,none": 0.006835907129291598,
+ "acc_norm,none": 0.627874425114977,
+ "acc_norm_stderr,none": 0.006835907129291598
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738742453.9834554,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "araPro": "6801d81fb64458427c0b7638660f113d7777c17252b7552d3a623eccf14d861c"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 118831.218058398,
+ "end_time": 122448.367654043,
+ "total_evaluation_time_seconds": "3617.149595645009"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b1181bf06c2920b89aa3ef68423d2ecb62ca74d3
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6204081632653061,
+ "acc_stderr,none": 0.0039242758195679964,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6664829106945975,
+ "acc_stderr,none": 0.007611297890057881,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.5092105263157894,
+ "acc_stderr,none": 0.018145770683067157
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6736526946107785,
+ "acc_stderr,none": 0.02569424876081477
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6153846153846154,
+ "acc_stderr,none": 0.07892141169885801
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.6291079812206573,
+ "acc_stderr,none": 0.019123879653915377
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6945812807881774,
+ "acc_stderr,none": 0.032406615658684086
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7100840336134454,
+ "acc_stderr,none": 0.029472485833136098
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.0465501041131961
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8188188188188188,
+ "acc_stderr,none": 0.01219228709045048
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.5828025477707006,
+ "acc_stderr,none": 0.02787143797110679
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6148238153098421,
+ "acc_stderr,none": 0.011655671594931498,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7254901960784313,
+ "acc_stderr,none": 0.018054027458815198
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.6054794520547945,
+ "acc_stderr,none": 0.0256173278621582
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4205128205128205,
+ "acc_stderr,none": 0.025028610276710855
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.5925925925925926,
+ "acc_stderr,none": 0.09636202008710973
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.6626984126984127,
+ "acc_stderr,none": 0.02984216291210435
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6618357487922706,
+ "acc_stderr,none": 0.009495029305656414,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6507018992568125,
+ "acc_stderr,none": 0.013705549867019138
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6631944444444444,
+ "acc_stderr,none": 0.016088096594397746
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6627906976744186,
+ "acc_stderr,none": 0.036152631988716356
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7160493827160493,
+ "acc_stderr,none": 0.03553693417920618
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.7066666666666667,
+ "acc_stderr,none": 0.05292637528870839
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6070205479452054,
+ "acc_stderr,none": 0.00810821047606248,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.45977011494252873,
+ "acc_stderr,none": 0.053741581963657706
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5972222222222222,
+ "acc_stderr,none": 0.02588531808222096
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.5433526011560693,
+ "acc_stderr,none": 0.01546827879763711
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5169491525423728,
+ "acc_stderr,none": 0.03259765859155325
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7816091954022989,
+ "acc_stderr,none": 0.044551545932103705
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6691176470588235,
+ "acc_stderr,none": 0.028582709753898445
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5311203319502075,
+ "acc_stderr,none": 0.03221228576046391
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7368421052631579,
+ "acc_stderr,none": 0.058843894144731304
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7574468085106383,
+ "acc_stderr,none": 0.016154489454265293
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5675675675675675,
+ "acc_stderr,none": 0.057983774751431016
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5328467153284672,
+ "acc_stderr,none": 0.04278203076713147
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5571428571428572,
+ "acc_stderr,none": 0.03435911486831027
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5533980582524272,
+ "acc_stderr,none": 0.008425372356576838,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.4350603264726757,
+ "acc_stderr,none": 0.013212179051376388
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.6360153256704981,
+ "acc_stderr,none": 0.02983930237266775
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.4196078431372549,
+ "acc_stderr,none": 0.030964616656831888
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8888888888888888,
+ "acc_stderr,none": 0.06163335513613659
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.7355371900826446,
+ "acc_stderr,none": 0.028410318393787815
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6684210526315789,
+ "acc_stderr,none": 0.0342442478876195
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5574572127139364,
+ "acc_stderr,none": 0.024589705158305858
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8363095238095238,
+ "acc_stderr,none": 0.020214957089599826
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.671875,
+ "acc_stderr,none": 0.05915529526875285
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6204081632653061,
+ "acc_stderr,none": 0.0039242758195679964,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6664829106945975,
+ "acc_stderr,none": 0.007611297890057881,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6148238153098421,
+ "acc_stderr,none": 0.011655671594931498,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6618357487922706,
+ "acc_stderr,none": 0.009495029305656414,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6070205479452054,
+ "acc_stderr,none": 0.00810821047606248,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5533980582524272,
+ "acc_stderr,none": 0.008425372356576838,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_primary_math",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_middle_computer_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_history",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_prof_law"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_political_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_driving_test",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_univ_management",
+ "arabicmmlu_primary_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 1
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735995272.1049664,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 11146.797419869,
+ "end_time": 13802.445754899,
+ "total_evaluation_time_seconds": "2655.6483350299986"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/etec_v2_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..18154b9a7cbae2a55c88122ff4b2289984aea0ca
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.5331213566507684,
+ "acc_stderr,none": 0.01148799400336844,
+ "acc_norm,none": 0.5331213566507684,
+ "acc_norm_stderr,none": 0.01148799400336844
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738746335.5654905,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "etec_v2": "d74045de4716b9652a4bfefbbb9f15b8700f98c226ac24538bb01ca5e0c7c2b2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 122713.028312008,
+ "end_time": 122969.654428848,
+ "total_evaluation_time_seconds": "256.62611684000876"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/exams_ar_5_shot.json b/evaluations/ar/jais-family-30b-16k-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..eee22d6ba3e05dd16efd0a3f5e87bd4ec5095bec
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/exams_ar_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4972067039106145,
+ "acc_stderr,none": 0.021596373620103398,
+ "acc_norm,none": 0.4972067039106145,
+ "acc_norm_stderr,none": 0.021596373620103398
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": "1",
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737046313.960676,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 902609.9677068,
+ "end_time": 902779.302453321,
+ "total_evaluation_time_seconds": "169.3347465210827"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/gat_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b2708e393f5a8102f99b2865a6f7873ed99df490
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/gat_0_shot.json
@@ -0,0 +1,545 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.3484886491910197,
+ "acc_stderr,none": 0.0037194291415010767,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.2727272727272727,
+ "acc_stderr,none": 0.008580530512418336
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.3242258652094718,
+ "acc_stderr,none": 0.008935781854640976
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.2800883327199117,
+ "acc_stderr,none": 0.00861632818616305
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.4717703349282297,
+ "acc_stderr,none": 0.015449927959569091
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.3081967213114754,
+ "acc_stderr,none": 0.013225236964535328
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.46859504132231405,
+ "acc_stderr,none": 0.014351539649046162
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.2967791411042945,
+ "acc_stderr,none": 0.012655821799091272
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.3232876712328767,
+ "acc_stderr,none": 0.024515791774351408
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.46502835538752363,
+ "acc_stderr,none": 0.00970005895596934
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.3484886491910197,
+ "acc_stderr,none": 0.0037194291415010767,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 4
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735804631.9752336,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 38043.362871866,
+ "end_time": 39852.631370652,
+ "total_evaluation_time_seconds": "1809.2684987860048"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8440fc7b1d8eade8218f9afa80ce6c619d5abe67
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.7488488488488488,
+ "acc_stderr,none": 0.0043391400060673,
+ "acc_norm,none": 0.7488488488488488,
+ "acc_norm_stderr,none": 0.0043391400060673
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738746670.4129548,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "moe_ien_mcq": "10880f503e175cc1732ea242e62a05f551ab3037c2343137caef8ccae9b636d6"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 123047.830779962,
+ "end_time": 123936.794338963,
+ "total_evaluation_time_seconds": "888.9635590010002"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a6a9b53b7a36c76d487c2d85a5893d2914cbec5b
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6876180662888546,
+ "acc_stderr,none": 0.006074079799796524,
+ "acc_norm,none": 0.6876180662888546,
+ "acc_norm_stderr,none": 0.006074079799796524
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738747625.6598117,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {
+ "moe_ien_tf": "944b34dde7f12f68b21e22312c06a9cdc68419df98db10d8e947f07ff8680ed0"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 124003.170151918,
+ "end_time": 124544.441198311,
+ "total_evaluation_time_seconds": "541.271046392998"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-16k-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-family-30b-16k-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..07d50ef94c753fa1c5390d9d9868b0cd743d0bdd
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-16k-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2662 @@
+{
+ "results": {
+ "openaimmlu": {
+ "acc,none": 0.5097564449508617,
+ "acc_stderr,none": 0.004024556823322554,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.42549668874172186,
+ "acc_stderr,none": 0.008775212636298942,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.3,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5197368421052632,
+ "acc_stderr,none": 0.04065771002562605
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5763888888888888,
+ "acc_stderr,none": 0.041321250197233685
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.41,
+ "acc_stderr,none": 0.049431107042371025
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.42,
+ "acc_stderr,none": 0.049604496374885836
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.26,
+ "acc_stderr,none": 0.04408440022768078
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.27450980392156865,
+ "acc_stderr,none": 0.04440521906179326
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.04688261722621505
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.4297872340425532,
+ "acc_stderr,none": 0.03236214467715564
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.2982456140350877,
+ "acc_stderr,none": 0.04303684033537316
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.46206896551724136,
+ "acc_stderr,none": 0.041546596717075474
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.36772486772486773,
+ "acc_stderr,none": 0.024833839825562413
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.6290322580645161,
+ "acc_stderr,none": 0.027480541887953593
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.43842364532019706,
+ "acc_stderr,none": 0.03491207857486518
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.2962962962962963,
+ "acc_stderr,none": 0.02784081149587192
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.39072847682119205,
+ "acc_stderr,none": 0.039837983066598075
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.35185185185185186,
+ "acc_stderr,none": 0.03256850570293648
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.655210643015521,
+ "acc_stderr,none": 0.01099578815242949,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.793939393939394,
+ "acc_stderr,none": 0.0315841532404771
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6617647058823529,
+ "acc_stderr,none": 0.03320574612945431
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7763713080168776,
+ "acc_stderr,none": 0.027123298205229966
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7024793388429752,
+ "acc_stderr,none": 0.04173349148083498
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.04803752235190193
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6196319018404908,
+ "acc_stderr,none": 0.038142698932618374
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.6045016077170418,
+ "acc_stderr,none": 0.027770918531427834
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.5246913580246914,
+ "acc_stderr,none": 0.02778680093142745
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7485380116959064,
+ "acc_stderr,none": 0.033275044238468436
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5028658125421444,
+ "acc_stderr,none": 0.006273334147065933,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.48148148148148145,
+ "acc_stderr,none": 0.043163785995113245
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5471698113207547,
+ "acc_stderr,none": 0.03063562795796182
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.4508670520231214,
+ "acc_stderr,none": 0.037940126746970296
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.3492063492063492,
+ "acc_stderr,none": 0.04263906892795132
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001975
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7070707070707071,
+ "acc_stderr,none": 0.032424979581788166
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.6568807339449542,
+ "acc_stderr,none": 0.02035477773608604
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6143497757847534,
+ "acc_stderr,none": 0.03266842214289201
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.375,
+ "acc_stderr,none": 0.04595091388086298
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.04824181513244218
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.669220945083014,
+ "acc_stderr,none": 0.01682481846256375
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6209150326797386,
+ "acc_stderr,none": 0.027780141207023327
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3900709219858156,
+ "acc_stderr,none": 0.02909767559946393
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.3513689700130378,
+ "acc_stderr,none": 0.01219296945748402
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4007352941176471,
+ "acc_stderr,none": 0.029768263528933105
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.49019607843137253,
+ "acc_stderr,none": 0.020223946005074305
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5963855421686747,
+ "acc_stderr,none": 0.038194861407583984
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.519780888618381,
+ "acc_stderr,none": 0.008126248479718141,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.7150259067357513,
+ "acc_stderr,none": 0.03257714077709661
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.4564102564102564,
+ "acc_stderr,none": 0.02525448542479961
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.47478991596638653,
+ "acc_stderr,none": 0.0324371805513741
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6564885496183206,
+ "acc_stderr,none": 0.041649760719448786
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6310679611650486,
+ "acc_stderr,none": 0.0477761518115674
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.7692307692307693,
+ "acc_stderr,none": 0.027601921381417597
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.630057803468208,
+ "acc_stderr,none": 0.02599247202930637
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2581005586592179,
+ "acc_stderr,none": 0.014635185616527829
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6272727272727273,
+ "acc_stderr,none": 0.04631381319425465
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6571428571428571,
+ "acc_stderr,none": 0.030387262919547724
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6517412935323383,
+ "acc_stderr,none": 0.03368787466115459
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.78,
+ "acc_stderr,none": 0.041633319989322605
+ }
+ },
+ "groups": {
+ "openaimmlu": {
+ "acc,none": 0.5097564449508617,
+ "acc_stderr,none": 0.004024556823322554,
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.42549668874172186,
+ "acc_stderr,none": 0.008775212636298942,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.655210643015521,
+ "acc_stderr,none": 0.01099578815242949,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5028658125421444,
+ "acc_stderr,none": 0.006273334147065933,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.519780888618381,
+ "acc_stderr,none": 0.008126248479718141,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_international_law",
+ "openaimmlu_world_religions",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_prehistory",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_philosophy"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_marketing",
+ "openaimmlu_security_studies",
+ "openaimmlu_business_ethics",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_management",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_public_relations"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_anatomy",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_professional_law",
+ "openaimmlu_virology",
+ "openaimmlu_human_aging",
+ "openaimmlu_global_facts",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_machine_learning",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_college_medicine",
+ "openaimmlu_formal_logic",
+ "openaimmlu_nutrition",
+ "openaimmlu_medical_genetics"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_astronomy",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_college_biology",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_physics",
+ "openaimmlu_conceptual_physics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu": 0,
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-16k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "369f88eeee4d313155f1b1dca4ebec90f9f9f2a4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735802966.5463448,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 16384,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-16k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-16k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 36377.913584311,
+ "end_time": 38003.487732411,
+ "total_evaluation_time_seconds": "1625.5741481000005"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/acva_5_shot.json b/evaluations/ar/jais-family-30b-8k-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a93b912ec512eb93ba0cc209e989b369bec8f362
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7522388059701492,
+ "acc_stderr,none": 0.004626050445211006,
+ "acc_norm,none": 0.7446613088404134,
+ "acc_norm_stderr,none": 0.004672545760635334
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737022392.8575761,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 878688.97735783,
+ "end_time": 879286.125326537,
+ "total_evaluation_time_seconds": "597.1479687069077"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a00c256a9aedfc643e64882432378b205eba86c8
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.16791044776119404,
+ "prompt_level_strict_acc_stderr,none": 0.016160210122502155,
+ "inst_level_strict_acc,none": 0.5467576791808874,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.1921641791044776,
+ "prompt_level_loose_acc_stderr,none": 0.017034166182138526,
+ "inst_level_loose_acc,none": 0.5733788395904437,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738753006.465129,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "ar_ifeval": "09fb0c6580f0a42624590f94c9483581a566f54a07cf60f59a60d159e4c054e2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 752127.533815689,
+ "end_time": 758558.307581761,
+ "total_evaluation_time_seconds": "6430.773766072001"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-family-30b-8k-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8720a857b9258b9a12202513c29905f3e93b1cc0
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.3338842975206612,
+ "acc_stderr,none": 0.01918908929564786,
+ "acc_norm,none": 0.3338842975206612,
+ "acc_norm_stderr,none": 0.01918908929564786
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738749227.274373,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "araMath_v3": "d0d66a51e36e6cb52cf906fef452bc518aad1a1e641c82f522dc8014f42cc48e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 748348.274887979,
+ "end_time": 748521.714000069,
+ "total_evaluation_time_seconds": "173.43911208992358"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/araPro_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a2836d199a391cd704667a6c004275f1f34a7cbf
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.6126774645070986,
+ "acc_stderr,none": 0.0068891768592808725,
+ "acc_norm,none": 0.6126774645070986,
+ "acc_norm_stderr,none": 0.0068891768592808725
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738742520.3000932,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "araPro": "6801d81fb64458427c0b7638660f113d7777c17252b7552d3a623eccf14d861c"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 741641.463589287,
+ "end_time": 745157.252657071,
+ "total_evaluation_time_seconds": "3515.789067783975"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..97db76a4080bd4c6d520fc64ca7ae5c1903c944f
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.6311310965063992,
+ "acc_stderr,none": 0.003915956721287854,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6714443219404631,
+ "acc_stderr,none": 0.007626754166189928,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.531578947368421,
+ "acc_stderr,none": 0.018112616894172776
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.6736526946107785,
+ "acc_stderr,none": 0.02569424876081477
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.6410256410256411,
+ "acc_stderr,none": 0.07781756136754926
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.6416275430359938,
+ "acc_stderr,none": 0.01898446977296123
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6995073891625616,
+ "acc_stderr,none": 0.03225799476233485
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.7058823529411765,
+ "acc_stderr,none": 0.02959732973097811
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6862745098039216,
+ "acc_stderr,none": 0.04617034827006719
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.8078078078078078,
+ "acc_stderr,none": 0.012472589323047442
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.589171974522293,
+ "acc_stderr,none": 0.02780858573833121
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6269744835965978,
+ "acc_stderr,none": 0.011579557089948563,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.7369281045751634,
+ "acc_stderr,none": 0.017812676542320657
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.5780821917808219,
+ "acc_stderr,none": 0.025885587833598424
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.4461538461538462,
+ "acc_stderr,none": 0.02520357177302833
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.08153326507837146
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.6944444444444444,
+ "acc_stderr,none": 0.02907548617844108
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6827697262479872,
+ "acc_stderr,none": 0.009332799025507354,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6655656482246077,
+ "acc_stderr,none": 0.013563076277979228
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.6805555555555556,
+ "acc_stderr,none": 0.015871722574177006
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.7267441860465116,
+ "acc_stderr,none": 0.034078261673374376
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.7469135802469136,
+ "acc_stderr,none": 0.034265467459005515
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.7466666666666667,
+ "acc_stderr,none": 0.05055844297598725
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6073059360730594,
+ "acc_stderr,none": 0.008116425662399026,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.47126436781609193,
+ "acc_stderr,none": 0.05382727149237504
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5722222222222222,
+ "acc_stderr,none": 0.02611224702350195
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.5211946050096339,
+ "acc_stderr,none": 0.015512796494523768
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.5720338983050848,
+ "acc_stderr,none": 0.032276143452228304
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7011494252873564,
+ "acc_stderr,none": 0.049360904959780114
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6838235294117647,
+ "acc_stderr,none": 0.028245687391462927
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5435684647302904,
+ "acc_stderr,none": 0.0321520987444214
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.7192982456140351,
+ "acc_stderr,none": 0.060045857397047285
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.7546099290780142,
+ "acc_stderr,none": 0.016218228731984394
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.5945945945945946,
+ "acc_stderr,none": 0.05746373039227156
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.5766423357664233,
+ "acc_stderr,none": 0.04236795684728882
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.6238095238095238,
+ "acc_stderr,none": 0.03350863645112521
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5734419041653618,
+ "acc_stderr,none": 0.008456089718778688,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.46699787083037614,
+ "acc_stderr,none": 0.013295987397473433
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5900383141762452,
+ "acc_stderr,none": 0.030501771826233554
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.47058823529411764,
+ "acc_stderr,none": 0.03131846503821582
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8148148148148148,
+ "acc_stderr,none": 0.07618086585254093
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.731404958677686,
+ "acc_stderr,none": 0.02855087510553791
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.7421052631578947,
+ "acc_stderr,none": 0.031821679205643966
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5819070904645477,
+ "acc_stderr,none": 0.024419296278041777
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8273809523809523,
+ "acc_stderr,none": 0.020647844166180294
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.671875,
+ "acc_stderr,none": 0.05915529526875285
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.6311310965063992,
+ "acc_stderr,none": 0.003915956721287854,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.6714443219404631,
+ "acc_stderr,none": 0.007626754166189928,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.6269744835965978,
+ "acc_stderr,none": 0.011579557089948563,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6827697262479872,
+ "acc_stderr,none": 0.009332799025507354,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.6073059360730594,
+ "acc_stderr,none": 0.008116425662399026,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5734419041653618,
+ "acc_stderr,none": 0.008456089718778688,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language",
+ "arabicmmlu_arabic_language_(general)"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_middle_natural_science",
+ "arabicmmlu_high_biology"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_history",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_philosophy",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_middle_islamic_studies"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_univ_accounting",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_primary_social_science"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_middle_general_knowledge",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_univ_management"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 1
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1736967874.5336635,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 824172.012803095,
+ "end_time": 825725.137463907,
+ "total_evaluation_time_seconds": "1553.124660811969"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/etec_v2_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..883f1641689c93f3cfa85f6eda38c3d18ee93536
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.5352411234764176,
+ "acc_stderr,none": 0.011484649333613872,
+ "acc_norm,none": 0.5352411234764176,
+ "acc_norm_stderr,none": 0.011484649333613872
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738746289.8466635,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "etec_v2": "d74045de4716b9652a4bfefbbb9f15b8700f98c226ac24538bb01ca5e0c7c2b2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 745410.928285038,
+ "end_time": 745645.171704659,
+ "total_evaluation_time_seconds": "234.24341962102335"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/exams_ar_5_shot.json b/evaluations/ar/jais-family-30b-8k-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b3d55f86d90c475e5bcd1940219c3ae71b8ee31a
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.5027932960893855,
+ "acc_stderr,none": 0.02159637362010341,
+ "acc_norm,none": 0.5027932960893855,
+ "acc_norm_stderr,none": 0.02159637362010341
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b4b2b49c",
+ "date": 1737019753.2507129,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.44.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 876049.600112476,
+ "end_time": 876201.430001535,
+ "total_evaluation_time_seconds": "151.82988905895036"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/gat_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a66e1b59930be169466706a51f610a4352789826
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.36435469710272167,
+ "acc_stderr,none": 0.0037275134732835647,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.2920222634508349,
+ "acc_stderr,none": 0.008760300143927015
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.35774134790528234,
+ "acc_stderr,none": 0.009150556306755668
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.30180345969819655,
+ "acc_stderr,none": 0.00880817775509723
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.48899521531100476,
+ "acc_stderr,none": 0.015470862946219716
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.21967213114754097,
+ "acc_stderr,none": 0.011858347905544155
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.5173553719008265,
+ "acc_stderr,none": 0.014371267374310048
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.28297546012269936,
+ "acc_stderr,none": 0.012478695554449207
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.273972602739726,
+ "acc_stderr,none": 0.023376494233709254
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.5092627599243856,
+ "acc_stderr,none": 0.009722204284872768
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.36435469710272167,
+ "acc_stderr,none": 0.0037275134732835647,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731336532.5150154,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 122997.247660745,
+ "end_time": 128873.09139221,
+ "total_evaluation_time_seconds": "5875.843731465007"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1e3f872505aaeb2c60e73edda4fcfbebb9e3bf30
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.7276276276276277,
+ "acc_stderr,none": 0.004454255352343356,
+ "acc_norm,none": 0.7276276276276277,
+ "acc_norm_stderr,none": 0.004454255352343356
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738746600.1540549,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "moe_ien_mcq": "10880f503e175cc1732ea242e62a05f551ab3037c2343137caef8ccae9b636d6"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 745721.017381925,
+ "end_time": 746587.515954665,
+ "total_evaluation_time_seconds": "866.4985727400053"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..db74dbfbdd16fb3cb1b13744fdf8b5faa8b18e4e
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.7065086725055814,
+ "acc_stderr,none": 0.005967882782201126,
+ "acc_norm,none": 0.7065086725055814,
+ "acc_norm_stderr,none": 0.005967882782201126
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738747536.6007946,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {
+ "moe_ien_tf": "944b34dde7f12f68b21e22312c06a9cdc68419df98db10d8e947f07ff8680ed0"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 746657.561119232,
+ "end_time": 747176.179915832,
+ "total_evaluation_time_seconds": "518.6187966000289"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-30b-8k-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-family-30b-8k-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..12e5d1b135d1d2db3ccee200345f14adf8d7a6b8
--- /dev/null
+++ b/evaluations/ar/jais-family-30b-8k-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2653 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.42317880794701984,
+ "acc_stderr,none": 0.00879868850969859,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.32,
+ "acc_stderr,none": 0.046882617226215034
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.5460526315789473,
+ "acc_stderr,none": 0.04051646342874142
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.5277777777777778,
+ "acc_stderr,none": 0.04174752578923183
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.047258156262526045
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.048523658709391
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.25,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.29411764705882354,
+ "acc_stderr,none": 0.04533838195929774
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.65,
+ "acc_stderr,none": 0.047937248544110196
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.42127659574468085,
+ "acc_stderr,none": 0.03227834510146267
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3508771929824561,
+ "acc_stderr,none": 0.044895393502706986
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5172413793103449,
+ "acc_stderr,none": 0.04164188720169375
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.38095238095238093,
+ "acc_stderr,none": 0.025010749116137602
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.027869320571664632
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.4482758620689655,
+ "acc_stderr,none": 0.03499113137676744
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.51,
+ "acc_stderr,none": 0.05024183937956913
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.3074074074074074,
+ "acc_stderr,none": 0.028133252578815646
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.3973509933774834,
+ "acc_stderr,none": 0.0399552400768168
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.3148148148148148,
+ "acc_stderr,none": 0.03167468706828979
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6529933481152993,
+ "acc_stderr,none": 0.011015620283718329,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7454545454545455,
+ "acc_stderr,none": 0.03401506715249039
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.6813725490196079,
+ "acc_stderr,none": 0.032702871814820796
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.7932489451476793,
+ "acc_stderr,none": 0.0263616516683891
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.6611570247933884,
+ "acc_stderr,none": 0.04320767807536671
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5185185185185185,
+ "acc_stderr,none": 0.04830366024635331
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.6441717791411042,
+ "acc_stderr,none": 0.03761521380046734
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.617363344051447,
+ "acc_stderr,none": 0.027604689028581982
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.5185185185185185,
+ "acc_stderr,none": 0.02780165621232366
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.7426900584795322,
+ "acc_stderr,none": 0.03352799844161865
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5089345920431557,
+ "acc_stderr,none": 0.006348375134748246,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.04292596718256981
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5169811320754717,
+ "acc_stderr,none": 0.030755120364119898
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.4393063583815029,
+ "acc_stderr,none": 0.037842719328874674
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5158730158730159,
+ "acc_stderr,none": 0.044698818540726076
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.39,
+ "acc_stderr,none": 0.04902071300001974
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.03358618145732523
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.6422018348623854,
+ "acc_stderr,none": 0.02055206078482782
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6143497757847534,
+ "acc_stderr,none": 0.03266842214289201
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.36607142857142855,
+ "acc_stderr,none": 0.04572372358737431
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.65,
+ "acc_stderr,none": 0.047937248544110196
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.6500638569604087,
+ "acc_stderr,none": 0.017055679797150433
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5718954248366013,
+ "acc_stderr,none": 0.028332397483664278
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3333333333333333,
+ "acc_stderr,none": 0.028121636040639882
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.4048239895697523,
+ "acc_stderr,none": 0.012536743830953984
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.44485294117647056,
+ "acc_stderr,none": 0.03018753206032938
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.4918300653594771,
+ "acc_stderr,none": 0.020225134343057255
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5783132530120482,
+ "acc_stderr,none": 0.03844453181770917
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5091296409007913,
+ "acc_stderr,none": 0.008080375838360021,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.7,
+ "acc_stderr,none": 0.046056618647183814
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.6683937823834197,
+ "acc_stderr,none": 0.03397636541089118
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.4205128205128205,
+ "acc_stderr,none": 0.025028610276710855
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.46218487394957986,
+ "acc_stderr,none": 0.032385469487589795
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7251908396946565,
+ "acc_stderr,none": 0.039153454088478354
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6407766990291263,
+ "acc_stderr,none": 0.047504583990416946
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.7393162393162394,
+ "acc_stderr,none": 0.028760348956523414
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.6416184971098265,
+ "acc_stderr,none": 0.025816756791584204
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.24134078212290502,
+ "acc_stderr,none": 0.014310999547961455
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6727272727272727,
+ "acc_stderr,none": 0.04494290866252089
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5836734693877551,
+ "acc_stderr,none": 0.031557828165561644
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.6616915422885572,
+ "acc_stderr,none": 0.033455630703391914
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.78,
+ "acc_stderr,none": 0.04163331998932261
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.42317880794701984,
+ "acc_stderr,none": 0.00879868850969859,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.6529933481152993,
+ "acc_stderr,none": 0.011015620283718329,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.5089345920431557,
+ "acc_stderr,none": 0.006348375134748246,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.5091296409007913,
+ "acc_stderr,none": 0.008080375838360021,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_jurisprudence",
+ "openaimmlu_world_religions",
+ "openaimmlu_philosophy",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_international_law",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_prehistory"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_business_ethics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_marketing",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_moral_disputes",
+ "openaimmlu_sociology",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_management",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_security_studies",
+ "openaimmlu_public_relations",
+ "openaimmlu_high_school_microeconomics"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_professional_medicine",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_virology",
+ "openaimmlu_anatomy",
+ "openaimmlu_formal_logic",
+ "openaimmlu_professional_law",
+ "openaimmlu_human_aging",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_college_medicine",
+ "openaimmlu_high_school_geography",
+ "openaimmlu_nutrition",
+ "openaimmlu_machine_learning",
+ "openaimmlu_global_facts",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_clinical_knowledge"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_computer_security",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_astronomy",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_college_biology",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_college_physics",
+ "openaimmlu_econometrics",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_high_school_statistics",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_high_school_computer_science"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=inceptionai/jais-family-30b-8k-chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 30208489464,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "dab185164dd3b79ec9201d7f4cf878ce91ae7e14",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731336538.8729222,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-30b-8k-chat",
+ "model_name_sanitized": "inceptionai__jais-family-30b-8k-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 123003.574879592,
+ "end_time": 128796.590605457,
+ "total_evaluation_time_seconds": "5793.015725865"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/acva_5_shot.json b/evaluations/ar/jais-family-6p7b-chat/acva_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0d589bbf5d122ff4b8dd455d4c44e543697225ce
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/acva_5_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "acva": {
+ "alias": "acva",
+ "acc,none": 0.7362801377726751,
+ "acc_stderr,none": 0.004721813366850479,
+ "acc_norm,none": 0.7380022962112515,
+ "acc_norm_stderr,none": 0.004711871670802378
+ }
+ },
+ "group_subtasks": {
+ "acva": []
+ },
+ "configs": {
+ "acva": {
+ "task": "acva",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "acva": 1.0
+ },
+ "n-shot": {
+ "acva": 5
+ },
+ "higher_is_better": {
+ "acva": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "acva": {
+ "original": 8710,
+ "effective": 8710
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737031815.1720507,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.9\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.3.107\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\nGPU 2: NVIDIA A100 80GB PCIe\nGPU 3: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (12 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.15.0rc2\n[pip3] open_clip_torch==2.26.1\n[pip3] optree==0.10.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.2.0a0\n[pip3] torchdata==0.7.0a0\n[pip3] torchdiffeq==0.2.4\n[pip3] torchmetrics==1.4.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.17.0a0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 888109.152536122,
+ "end_time": 888872.198441387,
+ "total_evaluation_time_seconds": "763.0459052650258"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/ar_ifeval_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/ar_ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a437ad3db9f704038036a11bdee009dafa063da0
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/ar_ifeval_0_shot.json
@@ -0,0 +1,142 @@
+{
+ "results": {
+ "ar_ifeval": {
+ "alias": "ar_ifeval",
+ "prompt_level_strict_acc,none": 0.13992537313432835,
+ "prompt_level_strict_acc_stderr,none": 0.01499820943129382,
+ "inst_level_strict_acc,none": 0.5296928327645051,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.19402985074626866,
+ "prompt_level_loose_acc_stderr,none": 0.017096879956145804,
+ "inst_level_loose_acc,none": 0.5829351535836177,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ar_ifeval": []
+ },
+ "configs": {
+ "ar_ifeval": {
+ "task": "ar_ifeval",
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
+ "dataset_name": "ar_ifeval",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ar_ifeval": 4.0
+ },
+ "n-shot": {
+ "ar_ifeval": 0
+ },
+ "higher_is_better": {
+ "ar_ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ar_ifeval": {
+ "original": 536,
+ "effective": 536
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621726.7246006,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "ar_ifeval": "09fb0c6580f0a42624590f94c9483581a566f54a07cf60f59a60d159e4c054e2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 998103.97224687,
+ "end_time": 1001143.402077609,
+ "total_evaluation_time_seconds": "3039.4298307389254"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/araMath_v3_5_shot.json b/evaluations/ar/jais-family-6p7b-chat/araMath_v3_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..202f47f1a6227cc5f4c623187c969cde14473fcd
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/araMath_v3_5_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "araMath_v3": {
+ "alias": "araMath_v3",
+ "acc,none": 0.2528925619834711,
+ "acc_stderr,none": 0.01768646703051157,
+ "acc_norm,none": 0.2528925619834711,
+ "acc_norm_stderr,none": 0.01768646703051157
+ }
+ },
+ "group_subtasks": {
+ "araMath_v3": []
+ },
+ "configs": {
+ "araMath_v3": {
+ "task": "araMath_v3",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
+ "dataset_name": "araMath_v3",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "araMath_v3": 0.0
+ },
+ "n-shot": {
+ "araMath_v3": 5
+ },
+ "higher_is_better": {
+ "araMath_v3": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araMath_v3": {
+ "original": 605,
+ "effective": 605
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621599.63682,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "araMath_v3": "d0d66a51e36e6cb52cf906fef452bc518aad1a1e641c82f522dc8014f42cc48e"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 997976.877897655,
+ "end_time": 998038.449346402,
+ "total_evaluation_time_seconds": "61.57144874695223"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/araPro_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/araPro_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e32bbd4c41f250172c69d882e5cf29db323b94b4
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/araPro_0_shot.json
@@ -0,0 +1,130 @@
+{
+ "results": {
+ "araPro": {
+ "alias": "araPro",
+ "acc,none": 0.5430913817236552,
+ "acc_stderr,none": 0.0070447588009972875,
+ "acc_norm,none": 0.5430913817236552,
+ "acc_norm_stderr,none": 0.0070447588009972875
+ }
+ },
+ "group_subtasks": {
+ "araPro": []
+ },
+ "configs": {
+ "araPro": {
+ "task": "araPro",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
+ "dataset_name": "araPro",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{choices}}",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "araPro": 2.0
+ },
+ "n-shot": {
+ "araPro": 0
+ },
+ "higher_is_better": {
+ "araPro": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "araPro": {
+ "original": 5001,
+ "effective": 5001
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739619950.267259,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "araPro": "6801d81fb64458427c0b7638660f113d7777c17252b7552d3a623eccf14d861c"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 996327.677962648,
+ "end_time": 997233.300882672,
+ "total_evaluation_time_seconds": "905.6229200239759"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/arabicmmlu_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/arabicmmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3d6517d4a3f2c4ddf9abc2451801478bd9e76817
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/arabicmmlu_0_shot.json
@@ -0,0 +1,2051 @@
+{
+ "results": {
+ "arabicmmlu": {
+ "acc,none": 0.5615358007609823,
+ "acc_stderr,none": 0.0040081744379782324,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5793825799338479,
+ "acc_stderr,none": 0.007845556182843596,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_high_history": {
+ "alias": " - High History",
+ "acc,none": 0.4644736842105263,
+ "acc_stderr,none": 0.018102980227879498
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "alias": " - High Islamic Studies",
+ "acc,none": 0.5568862275449101,
+ "acc_stderr,none": 0.02722191955486199
+ },
+ "arabicmmlu_high_philosophy": {
+ "alias": " - High Philosophy",
+ "acc,none": 0.5641025641025641,
+ "acc_stderr,none": 0.08044135838502685
+ },
+ "arabicmmlu_islamic_studies": {
+ "alias": " - Islamic Studies",
+ "acc,none": 0.5446009389671361,
+ "acc_stderr,none": 0.019716277358004537
+ },
+ "arabicmmlu_middle_history": {
+ "alias": " - Middle History",
+ "acc,none": 0.6305418719211823,
+ "acc_stderr,none": 0.03395970381998574
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "alias": " - Middle Islamic Studies",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.030388353551886804
+ },
+ "arabicmmlu_primary_history": {
+ "alias": " - Primary History",
+ "acc,none": 0.6274509803921569,
+ "acc_stderr,none": 0.04810840148082633
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "alias": " - Primary Islamic Studies",
+ "acc,none": 0.7567567567567568,
+ "acc_stderr,none": 0.013581047734799375
+ },
+ "arabicmmlu_prof_law": {
+ "alias": " - Prof Law",
+ "acc,none": 0.267515923566879,
+ "acc_stderr,none": 0.02502083184496839
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5419198055893074,
+ "acc_stderr,none": 0.011963912297784807,
+ "alias": " - Language"
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "alias": " - Arabic Language (General)",
+ "acc,none": 0.6486928104575164,
+ "acc_stderr,none": 0.019312676065786558
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "alias": " - Arabic Language (Grammar)",
+ "acc,none": 0.4821917808219178,
+ "acc_stderr,none": 0.026190493374762456
+ },
+ "arabicmmlu_high_arabic_language": {
+ "alias": " - High Arabic Language",
+ "acc,none": 0.36923076923076925,
+ "acc_stderr,none": 0.02446861524147892
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "alias": " - Middle Arabic Language",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.09245003270420483
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "alias": " - Primary Arabic Language",
+ "acc,none": 0.623015873015873,
+ "acc_stderr,none": 0.03058963023693551
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6135265700483091,
+ "acc_stderr,none": 0.009769204350522023,
+ "alias": " - Other"
+ },
+ "arabicmmlu_driving_test": {
+ "alias": " - Driving Test",
+ "acc,none": 0.6193228736581338,
+ "acc_stderr,none": 0.01395867726280844
+ },
+ "arabicmmlu_general_knowledge": {
+ "alias": " - General Knowledge",
+ "acc,none": 0.5879629629629629,
+ "acc_stderr,none": 0.01675474084676195
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "alias": " - Middle General Knowledge",
+ "acc,none": 0.6337209302325582,
+ "acc_stderr,none": 0.03684317268101587
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "alias": " - Primary General Knowledge",
+ "acc,none": 0.6728395061728395,
+ "acc_stderr,none": 0.03697628122633146
+ },
+ "arabicmmlu_univ_management": {
+ "alias": " - Univ Management",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.05579886659703323
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.553082191780822,
+ "acc_stderr,none": 0.008233782175575884,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_high_civics": {
+ "alias": " - High Civics",
+ "acc,none": 0.4367816091954023,
+ "acc_stderr,none": 0.05348368965287097
+ },
+ "arabicmmlu_high_economics": {
+ "alias": " - High Economics",
+ "acc,none": 0.5416666666666666,
+ "acc_stderr,none": 0.026297202626624744
+ },
+ "arabicmmlu_high_geography": {
+ "alias": " - High Geography",
+ "acc,none": 0.4614643545279383,
+ "acc_stderr,none": 0.015480569337980291
+ },
+ "arabicmmlu_middle_civics": {
+ "alias": " - Middle Civics",
+ "acc,none": 0.4872881355932203,
+ "acc_stderr,none": 0.03260586088180842
+ },
+ "arabicmmlu_middle_economics": {
+ "alias": " - Middle Economics",
+ "acc,none": 0.7241379310344828,
+ "acc_stderr,none": 0.04819560289115228
+ },
+ "arabicmmlu_middle_geography": {
+ "alias": " - Middle Geography",
+ "acc,none": 0.6066176470588235,
+ "acc_stderr,none": 0.029674288281311155
+ },
+ "arabicmmlu_middle_social_science": {
+ "alias": " - Middle Social Science",
+ "acc,none": 0.5062240663900415,
+ "acc_stderr,none": 0.03227236052966302
+ },
+ "arabicmmlu_primary_geography": {
+ "alias": " - Primary Geography",
+ "acc,none": 0.631578947368421,
+ "acc_stderr,none": 0.06446025638903098
+ },
+ "arabicmmlu_primary_social_science": {
+ "alias": " - Primary Social Science",
+ "acc,none": 0.723404255319149,
+ "acc_stderr,none": 0.016858811203830114
+ },
+ "arabicmmlu_univ_accounting": {
+ "alias": " - Univ Accounting",
+ "acc,none": 0.4864864864864865,
+ "acc_stderr,none": 0.0584991962188687
+ },
+ "arabicmmlu_univ_economics": {
+ "alias": " - Univ Economics",
+ "acc,none": 0.48905109489051096,
+ "acc_stderr,none": 0.04286436555449051
+ },
+ "arabicmmlu_univ_political_science": {
+ "alias": " - Univ Political Science",
+ "acc,none": 0.5333333333333333,
+ "acc_stderr,none": 0.03450878044350498
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5202004384591293,
+ "acc_stderr,none": 0.008505739595068406,
+ "alias": " - STEM"
+ },
+ "arabicmmlu_high_biology": {
+ "alias": " - High Biology",
+ "acc,none": 0.42157558552164653,
+ "acc_stderr,none": 0.01316011566544646
+ },
+ "arabicmmlu_high_computer_science": {
+ "alias": " - High Computer Science",
+ "acc,none": 0.5325670498084292,
+ "acc_stderr,none": 0.030942837326193823
+ },
+ "arabicmmlu_high_physics": {
+ "alias": " - High Physics",
+ "acc,none": 0.3607843137254902,
+ "acc_stderr,none": 0.03013218860518198
+ },
+ "arabicmmlu_middle_computer_science": {
+ "alias": " - Middle Computer Science",
+ "acc,none": 0.8148148148148148,
+ "acc_stderr,none": 0.07618086585254093
+ },
+ "arabicmmlu_middle_natural_science": {
+ "alias": " - Middle Natural Science",
+ "acc,none": 0.7148760330578512,
+ "acc_stderr,none": 0.029081962470760236
+ },
+ "arabicmmlu_primary_computer_science": {
+ "alias": " - Primary Computer Science",
+ "acc,none": 0.6368421052631579,
+ "acc_stderr,none": 0.03498104083833201
+ },
+ "arabicmmlu_primary_math": {
+ "alias": " - Primary Math",
+ "acc,none": 0.5232273838630807,
+ "acc_stderr,none": 0.02472696435617918
+ },
+ "arabicmmlu_primary_natural_science": {
+ "alias": " - Primary Natural Science",
+ "acc,none": 0.8035714285714286,
+ "acc_stderr,none": 0.02170661827371784
+ },
+ "arabicmmlu_univ_computer_science": {
+ "alias": " - Univ Computer Science",
+ "acc,none": 0.5625,
+ "acc_stderr,none": 0.0625
+ }
+ },
+ "groups": {
+ "arabicmmlu": {
+ "acc,none": 0.5615358007609823,
+ "acc_stderr,none": 0.0040081744379782324,
+ "alias": "arabicmmlu"
+ },
+ "arabicmmlu_humanities": {
+ "acc,none": 0.5793825799338479,
+ "acc_stderr,none": 0.007845556182843596,
+ "alias": " - Humanities"
+ },
+ "arabicmmlu_language": {
+ "acc,none": 0.5419198055893074,
+ "acc_stderr,none": 0.011963912297784807,
+ "alias": " - Language"
+ },
+ "arabicmmlu_other": {
+ "acc,none": 0.6135265700483091,
+ "acc_stderr,none": 0.009769204350522023,
+ "alias": " - Other"
+ },
+ "arabicmmlu_social_science": {
+ "acc,none": 0.553082191780822,
+ "acc_stderr,none": 0.008233782175575884,
+ "alias": " - Social Science"
+ },
+ "arabicmmlu_stem": {
+ "acc,none": 0.5202004384591293,
+ "acc_stderr,none": 0.008505739595068406,
+ "alias": " - STEM"
+ }
+ },
+ "group_subtasks": {
+ "arabicmmlu_language": [
+ "arabicmmlu_high_arabic_language",
+ "arabicmmlu_arabic_language_(grammar)",
+ "arabicmmlu_arabic_language_(general)",
+ "arabicmmlu_middle_arabic_language",
+ "arabicmmlu_primary_arabic_language"
+ ],
+ "arabicmmlu_stem": [
+ "arabicmmlu_primary_natural_science",
+ "arabicmmlu_high_physics",
+ "arabicmmlu_primary_computer_science",
+ "arabicmmlu_primary_math",
+ "arabicmmlu_middle_computer_science",
+ "arabicmmlu_univ_computer_science",
+ "arabicmmlu_high_biology",
+ "arabicmmlu_high_computer_science",
+ "arabicmmlu_middle_natural_science"
+ ],
+ "arabicmmlu_humanities": [
+ "arabicmmlu_middle_history",
+ "arabicmmlu_primary_history",
+ "arabicmmlu_middle_islamic_studies",
+ "arabicmmlu_high_islamic_studies",
+ "arabicmmlu_prof_law",
+ "arabicmmlu_islamic_studies",
+ "arabicmmlu_primary_islamic_studies",
+ "arabicmmlu_high_history",
+ "arabicmmlu_high_philosophy"
+ ],
+ "arabicmmlu_social_science": [
+ "arabicmmlu_middle_civics",
+ "arabicmmlu_univ_economics",
+ "arabicmmlu_primary_geography",
+ "arabicmmlu_middle_geography",
+ "arabicmmlu_primary_social_science",
+ "arabicmmlu_middle_social_science",
+ "arabicmmlu_high_economics",
+ "arabicmmlu_high_civics",
+ "arabicmmlu_high_geography",
+ "arabicmmlu_middle_economics",
+ "arabicmmlu_univ_political_science",
+ "arabicmmlu_univ_accounting"
+ ],
+ "arabicmmlu_other": [
+ "arabicmmlu_univ_management",
+ "arabicmmlu_driving_test",
+ "arabicmmlu_primary_general_knowledge",
+ "arabicmmlu_general_knowledge",
+ "arabicmmlu_middle_general_knowledge"
+ ],
+ "arabicmmlu": [
+ "arabicmmlu_other",
+ "arabicmmlu_social_science",
+ "arabicmmlu_humanities",
+ "arabicmmlu_stem",
+ "arabicmmlu_language"
+ ]
+ },
+ "configs": {
+ "arabicmmlu_arabic_language_(general)": {
+ "task": "arabicmmlu_arabic_language_(general)",
+ "task_alias": "Arabic Language (General)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (General)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "task": "arabicmmlu_arabic_language_(grammar)",
+ "task_alias": "Arabic Language (Grammar)",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Arabic Language (Grammar)",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_driving_test": {
+ "task": "arabicmmlu_driving_test",
+ "task_alias": "Driving Test",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Driving Test",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_general_knowledge": {
+ "task": "arabicmmlu_general_knowledge",
+ "task_alias": "General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_arabic_language": {
+ "task": "arabicmmlu_high_arabic_language",
+ "task_alias": "High Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_biology": {
+ "task": "arabicmmlu_high_biology",
+ "task_alias": "High Biology",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Biology",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_civics": {
+ "task": "arabicmmlu_high_civics",
+ "task_alias": "High Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_computer_science": {
+ "task": "arabicmmlu_high_computer_science",
+ "task_alias": "High Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_economics": {
+ "task": "arabicmmlu_high_economics",
+ "task_alias": "High Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_geography": {
+ "task": "arabicmmlu_high_geography",
+ "task_alias": "High Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_history": {
+ "task": "arabicmmlu_high_history",
+ "task_alias": "High History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "task": "arabicmmlu_high_islamic_studies",
+ "task_alias": "High Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_philosophy": {
+ "task": "arabicmmlu_high_philosophy",
+ "task_alias": "High Philosophy",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Philosophy",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_high_physics": {
+ "task": "arabicmmlu_high_physics",
+ "task_alias": "High Physics",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "High Physics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_islamic_studies": {
+ "task": "arabicmmlu_islamic_studies",
+ "task_alias": "Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "task": "arabicmmlu_middle_arabic_language",
+ "task_alias": "Middle Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_civics": {
+ "task": "arabicmmlu_middle_civics",
+ "task_alias": "Middle Civics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Civics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_computer_science": {
+ "task": "arabicmmlu_middle_computer_science",
+ "task_alias": "Middle Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_economics": {
+ "task": "arabicmmlu_middle_economics",
+ "task_alias": "Middle Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "task": "arabicmmlu_middle_general_knowledge",
+ "task_alias": "Middle General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_geography": {
+ "task": "arabicmmlu_middle_geography",
+ "task_alias": "Middle Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_history": {
+ "task": "arabicmmlu_middle_history",
+ "task_alias": "Middle History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "task": "arabicmmlu_middle_islamic_studies",
+ "task_alias": "Middle Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_natural_science": {
+ "task": "arabicmmlu_middle_natural_science",
+ "task_alias": "Middle Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_middle_social_science": {
+ "task": "arabicmmlu_middle_social_science",
+ "task_alias": "Middle Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Middle Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "task": "arabicmmlu_primary_arabic_language",
+ "task_alias": "Primary Arabic Language",
+ "tag": "arabicmmlu_language_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Arabic Language",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_computer_science": {
+ "task": "arabicmmlu_primary_computer_science",
+ "task_alias": "Primary Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "task": "arabicmmlu_primary_general_knowledge",
+ "task_alias": "Primary General Knowledge",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary General Knowledge",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_geography": {
+ "task": "arabicmmlu_primary_geography",
+ "task_alias": "Primary Geography",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Geography",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_history": {
+ "task": "arabicmmlu_primary_history",
+ "task_alias": "Primary History",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary History",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "task": "arabicmmlu_primary_islamic_studies",
+ "task_alias": "Primary Islamic Studies",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Islamic Studies",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_math": {
+ "task": "arabicmmlu_primary_math",
+ "task_alias": "Primary Math",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Math",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_natural_science": {
+ "task": "arabicmmlu_primary_natural_science",
+ "task_alias": "Primary Natural Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Natural Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_primary_social_science": {
+ "task": "arabicmmlu_primary_social_science",
+ "task_alias": "Primary Social Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Primary Social Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_prof_law": {
+ "task": "arabicmmlu_prof_law",
+ "task_alias": "Prof Law",
+ "tag": "arabicmmlu_humanities_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Prof Law",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_accounting": {
+ "task": "arabicmmlu_univ_accounting",
+ "task_alias": "Univ Accounting",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Accounting",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_computer_science": {
+ "task": "arabicmmlu_univ_computer_science",
+ "task_alias": "Univ Computer Science",
+ "tag": "arabicmmlu_stem_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Computer Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_economics": {
+ "task": "arabicmmlu_univ_economics",
+ "task_alias": "Univ Economics",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Economics",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_management": {
+ "task": "arabicmmlu_univ_management",
+ "task_alias": "Univ Management",
+ "tag": "arabicmmlu_other_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Management",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "arabicmmlu_univ_political_science": {
+ "task": "arabicmmlu_univ_political_science",
+ "task_alias": "Univ Political Science",
+ "tag": "arabicmmlu_social_science_tasks",
+ "dataset_path": "yazeed7/ArabicMMLU",
+ "dataset_name": "Univ Political Science",
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "def doc_to_text(doc):\n \"\"\"\n Refactoring `prepare_data_en` to fit with the lm harness framework.\n https://github.com/mbzuai-nlp/ArabicMMLU/blob/main/util_prompt.py\n \"\"\"\n\n level = \"\" if not doc[\"Level\"] else \" for \" + level_en[doc[\"Level\"]]\n country = \"\" if not doc[\"Country\"] else \" in \" + doc[\"Country\"]\n main_meta_data = f\"{doc['Subject']} question{level}{country}\"\n\n question = (\n doc[\"Question\"]\n if doc[\"Context\"] == \"\"\n else f\"{doc['Context']}\\n\\n{doc['Question']}\"\n )\n\n options = []\n for i, opt in enumerate(\n [\"Option 1\", \"Option 2\", \"Option 3\", \"Option 4\", \"Option 5\"]\n ):\n if not doc[opt]:\n break\n options.append(f\"{alpa[i]} {doc[opt]}\")\n\n doc_text = PROMPT.format(main_meta_data, question, \"\\n\".join(options))\n\n return doc_text\n",
+ "doc_to_target": "Answer Key",
+ "doc_to_choice": "def doc_to_choice(doc):\n return [alpa[i][0] for i in range(5) if doc[f\"Option {i+1}\"]]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "arabicmmlu": 0,
+ "arabicmmlu_arabic_language_(general)": 0.0,
+ "arabicmmlu_arabic_language_(grammar)": 0.0,
+ "arabicmmlu_driving_test": 0.0,
+ "arabicmmlu_general_knowledge": 0.0,
+ "arabicmmlu_high_arabic_language": 0.0,
+ "arabicmmlu_high_biology": 0.0,
+ "arabicmmlu_high_civics": 0.0,
+ "arabicmmlu_high_computer_science": 0.0,
+ "arabicmmlu_high_economics": 0.0,
+ "arabicmmlu_high_geography": 0.0,
+ "arabicmmlu_high_history": 0.0,
+ "arabicmmlu_high_islamic_studies": 0.0,
+ "arabicmmlu_high_philosophy": 0.0,
+ "arabicmmlu_high_physics": 0.0,
+ "arabicmmlu_humanities": 0,
+ "arabicmmlu_islamic_studies": 0.0,
+ "arabicmmlu_language": 0,
+ "arabicmmlu_middle_arabic_language": 0.0,
+ "arabicmmlu_middle_civics": 0.0,
+ "arabicmmlu_middle_computer_science": 0.0,
+ "arabicmmlu_middle_economics": 0.0,
+ "arabicmmlu_middle_general_knowledge": 0.0,
+ "arabicmmlu_middle_geography": 0.0,
+ "arabicmmlu_middle_history": 0.0,
+ "arabicmmlu_middle_islamic_studies": 0.0,
+ "arabicmmlu_middle_natural_science": 0.0,
+ "arabicmmlu_middle_social_science": 0.0,
+ "arabicmmlu_other": 0,
+ "arabicmmlu_primary_arabic_language": 0.0,
+ "arabicmmlu_primary_computer_science": 0.0,
+ "arabicmmlu_primary_general_knowledge": 0.0,
+ "arabicmmlu_primary_geography": 0.0,
+ "arabicmmlu_primary_history": 0.0,
+ "arabicmmlu_primary_islamic_studies": 0.0,
+ "arabicmmlu_primary_math": 0.0,
+ "arabicmmlu_primary_natural_science": 0.0,
+ "arabicmmlu_primary_social_science": 0.0,
+ "arabicmmlu_prof_law": 0.0,
+ "arabicmmlu_social_science": 0,
+ "arabicmmlu_stem": 0,
+ "arabicmmlu_univ_accounting": 0.0,
+ "arabicmmlu_univ_computer_science": 0.0,
+ "arabicmmlu_univ_economics": 0.0,
+ "arabicmmlu_univ_management": 0.0,
+ "arabicmmlu_univ_political_science": 0.0
+ },
+ "n-shot": {
+ "arabicmmlu_arabic_language_(general)": 0,
+ "arabicmmlu_arabic_language_(grammar)": 0,
+ "arabicmmlu_driving_test": 0,
+ "arabicmmlu_general_knowledge": 0,
+ "arabicmmlu_high_arabic_language": 0,
+ "arabicmmlu_high_biology": 0,
+ "arabicmmlu_high_civics": 0,
+ "arabicmmlu_high_computer_science": 0,
+ "arabicmmlu_high_economics": 0,
+ "arabicmmlu_high_geography": 0,
+ "arabicmmlu_high_history": 0,
+ "arabicmmlu_high_islamic_studies": 0,
+ "arabicmmlu_high_philosophy": 0,
+ "arabicmmlu_high_physics": 0,
+ "arabicmmlu_islamic_studies": 0,
+ "arabicmmlu_middle_arabic_language": 0,
+ "arabicmmlu_middle_civics": 0,
+ "arabicmmlu_middle_computer_science": 0,
+ "arabicmmlu_middle_economics": 0,
+ "arabicmmlu_middle_general_knowledge": 0,
+ "arabicmmlu_middle_geography": 0,
+ "arabicmmlu_middle_history": 0,
+ "arabicmmlu_middle_islamic_studies": 0,
+ "arabicmmlu_middle_natural_science": 0,
+ "arabicmmlu_middle_social_science": 0,
+ "arabicmmlu_primary_arabic_language": 0,
+ "arabicmmlu_primary_computer_science": 0,
+ "arabicmmlu_primary_general_knowledge": 0,
+ "arabicmmlu_primary_geography": 0,
+ "arabicmmlu_primary_history": 0,
+ "arabicmmlu_primary_islamic_studies": 0,
+ "arabicmmlu_primary_math": 0,
+ "arabicmmlu_primary_natural_science": 0,
+ "arabicmmlu_primary_social_science": 0,
+ "arabicmmlu_prof_law": 0,
+ "arabicmmlu_univ_accounting": 0,
+ "arabicmmlu_univ_computer_science": 0,
+ "arabicmmlu_univ_economics": 0,
+ "arabicmmlu_univ_management": 0,
+ "arabicmmlu_univ_political_science": 0
+ },
+ "higher_is_better": {
+ "arabicmmlu": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "acc": true
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "acc": true
+ },
+ "arabicmmlu_driving_test": {
+ "acc": true
+ },
+ "arabicmmlu_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_high_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_high_biology": {
+ "acc": true
+ },
+ "arabicmmlu_high_civics": {
+ "acc": true
+ },
+ "arabicmmlu_high_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_high_economics": {
+ "acc": true
+ },
+ "arabicmmlu_high_geography": {
+ "acc": true
+ },
+ "arabicmmlu_high_history": {
+ "acc": true
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_high_philosophy": {
+ "acc": true
+ },
+ "arabicmmlu_high_physics": {
+ "acc": true
+ },
+ "arabicmmlu_humanities": {
+ "acc": true
+ },
+ "arabicmmlu_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_middle_civics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_economics": {
+ "acc": true
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_middle_geography": {
+ "acc": true
+ },
+ "arabicmmlu_middle_history": {
+ "acc": true
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_middle_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_middle_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_other": {
+ "acc": true
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "acc": true
+ },
+ "arabicmmlu_primary_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "acc": true
+ },
+ "arabicmmlu_primary_geography": {
+ "acc": true
+ },
+ "arabicmmlu_primary_history": {
+ "acc": true
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "acc": true
+ },
+ "arabicmmlu_primary_math": {
+ "acc": true
+ },
+ "arabicmmlu_primary_natural_science": {
+ "acc": true
+ },
+ "arabicmmlu_primary_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_prof_law": {
+ "acc": true
+ },
+ "arabicmmlu_social_science": {
+ "acc": true
+ },
+ "arabicmmlu_stem": {
+ "acc": true
+ },
+ "arabicmmlu_univ_accounting": {
+ "acc": true
+ },
+ "arabicmmlu_univ_computer_science": {
+ "acc": true
+ },
+ "arabicmmlu_univ_economics": {
+ "acc": true
+ },
+ "arabicmmlu_univ_management": {
+ "acc": true
+ },
+ "arabicmmlu_univ_political_science": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "arabicmmlu_univ_management": {
+ "original": 75,
+ "effective": 75
+ },
+ "arabicmmlu_driving_test": {
+ "original": 1211,
+ "effective": 1211
+ },
+ "arabicmmlu_primary_general_knowledge": {
+ "original": 162,
+ "effective": 162
+ },
+ "arabicmmlu_general_knowledge": {
+ "original": 864,
+ "effective": 864
+ },
+ "arabicmmlu_middle_general_knowledge": {
+ "original": 172,
+ "effective": 172
+ },
+ "arabicmmlu_middle_civics": {
+ "original": 236,
+ "effective": 236
+ },
+ "arabicmmlu_univ_economics": {
+ "original": 137,
+ "effective": 137
+ },
+ "arabicmmlu_primary_geography": {
+ "original": 57,
+ "effective": 57
+ },
+ "arabicmmlu_middle_geography": {
+ "original": 272,
+ "effective": 272
+ },
+ "arabicmmlu_primary_social_science": {
+ "original": 705,
+ "effective": 705
+ },
+ "arabicmmlu_middle_social_science": {
+ "original": 241,
+ "effective": 241
+ },
+ "arabicmmlu_high_economics": {
+ "original": 360,
+ "effective": 360
+ },
+ "arabicmmlu_high_civics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_high_geography": {
+ "original": 1038,
+ "effective": 1038
+ },
+ "arabicmmlu_middle_economics": {
+ "original": 87,
+ "effective": 87
+ },
+ "arabicmmlu_univ_political_science": {
+ "original": 210,
+ "effective": 210
+ },
+ "arabicmmlu_univ_accounting": {
+ "original": 74,
+ "effective": 74
+ },
+ "arabicmmlu_middle_history": {
+ "original": 203,
+ "effective": 203
+ },
+ "arabicmmlu_primary_history": {
+ "original": 102,
+ "effective": 102
+ },
+ "arabicmmlu_middle_islamic_studies": {
+ "original": 238,
+ "effective": 238
+ },
+ "arabicmmlu_high_islamic_studies": {
+ "original": 334,
+ "effective": 334
+ },
+ "arabicmmlu_prof_law": {
+ "original": 314,
+ "effective": 314
+ },
+ "arabicmmlu_islamic_studies": {
+ "original": 639,
+ "effective": 639
+ },
+ "arabicmmlu_primary_islamic_studies": {
+ "original": 999,
+ "effective": 999
+ },
+ "arabicmmlu_high_history": {
+ "original": 760,
+ "effective": 760
+ },
+ "arabicmmlu_high_philosophy": {
+ "original": 39,
+ "effective": 39
+ },
+ "arabicmmlu_primary_natural_science": {
+ "original": 336,
+ "effective": 336
+ },
+ "arabicmmlu_high_physics": {
+ "original": 255,
+ "effective": 255
+ },
+ "arabicmmlu_primary_computer_science": {
+ "original": 190,
+ "effective": 190
+ },
+ "arabicmmlu_primary_math": {
+ "original": 409,
+ "effective": 409
+ },
+ "arabicmmlu_middle_computer_science": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_univ_computer_science": {
+ "original": 64,
+ "effective": 64
+ },
+ "arabicmmlu_high_biology": {
+ "original": 1409,
+ "effective": 1409
+ },
+ "arabicmmlu_high_computer_science": {
+ "original": 261,
+ "effective": 261
+ },
+ "arabicmmlu_middle_natural_science": {
+ "original": 242,
+ "effective": 242
+ },
+ "arabicmmlu_high_arabic_language": {
+ "original": 390,
+ "effective": 390
+ },
+ "arabicmmlu_arabic_language_(grammar)": {
+ "original": 365,
+ "effective": 365
+ },
+ "arabicmmlu_arabic_language_(general)": {
+ "original": 612,
+ "effective": 612
+ },
+ "arabicmmlu_middle_arabic_language": {
+ "original": 27,
+ "effective": 27
+ },
+ "arabicmmlu_primary_arabic_language": {
+ "original": 252,
+ "effective": 252
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737024933.7295105,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4246.203013659,
+ "end_time": 4515.04704094,
+ "total_evaluation_time_seconds": "268.8440272810003"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/etec_v2_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/etec_v2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..827d625f2774b5abcacd76184ea0bcca7efd45bd
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/etec_v2_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "etec_v2": {
+ "alias": "etec_v2",
+ "acc,none": 0.4546899841017488,
+ "acc_stderr,none": 0.011465911542349052,
+ "acc_norm,none": 0.4546899841017488,
+ "acc_norm_stderr,none": 0.011465911542349052
+ }
+ },
+ "group_subtasks": {
+ "etec_v2": []
+ },
+ "configs": {
+ "etec_v2": {
+ "task": "etec_v2",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
+ "dataset_name": "etec_v2",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "etec_v2": 0.0
+ },
+ "n-shot": {
+ "etec_v2": 0
+ },
+ "higher_is_better": {
+ "etec_v2": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "etec_v2": {
+ "original": 1887,
+ "effective": 1887
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739620923.1960719,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "etec_v2": "d74045de4716b9652a4bfefbbb9f15b8700f98c226ac24538bb01ca5e0c7c2b2"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 997300.665842326,
+ "end_time": 997374.082195903,
+ "total_evaluation_time_seconds": "73.41635357704945"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/exams_ar_5_shot.json b/evaluations/ar/jais-family-6p7b-chat/exams_ar_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..af389d0cccc5bbd3420595f7fadc9673fd0c2a47
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/exams_ar_5_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "exams_ar": {
+ "alias": "exams_ar",
+ "acc,none": 0.4692737430167598,
+ "acc_stderr,none": 0.021555893034147955,
+ "acc_norm,none": 0.4692737430167598,
+ "acc_norm_stderr,none": 0.021555893034147955
+ }
+ },
+ "group_subtasks": {
+ "exams_ar": []
+ },
+ "configs": {
+ "exams_ar": {
+ "task": "exams_ar",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/exams_ar",
+ "dataset_name": "exams_ar",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n \n \u0633\u0624\u0627\u0644:\n A. \n B. \n C. \n D. \n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "description",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "query",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "exams_ar": 1.0
+ },
+ "n-shot": {
+ "exams_ar": 5
+ },
+ "higher_is_better": {
+ "exams_ar": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "exams_ar": {
+ "original": 537,
+ "effective": 537
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 32
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737023749.692324,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 3062.094354052,
+ "end_time": 3798.137119034,
+ "total_evaluation_time_seconds": "736.0427649819999"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/gat_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/gat_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..803cad38126d30ac272d9c27705bfe22d4d924c2
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/gat_0_shot.json
@@ -0,0 +1,543 @@
+{
+ "results": {
+ "gat": {
+ "acc,none": 0.3171328232785652,
+ "acc_stderr,none": 0.003637711553191521,
+ "alias": "gat"
+ },
+ "gat_algebra": {
+ "alias": " - gat_algebra",
+ "acc,none": 0.27606679035250464,
+ "acc_stderr,none": 0.008613061282358605
+ },
+ "gat_analogy": {
+ "alias": " - gat_analogy",
+ "acc,none": 0.28123861566484515,
+ "acc_stderr,none": 0.008582973872557074
+ },
+ "gat_arithmetic": {
+ "alias": " - gat_arithmetic",
+ "acc,none": 0.2465955097534045,
+ "acc_stderr,none": 0.008270691113113376
+ },
+ "gat_association": {
+ "alias": " - gat_association",
+ "acc,none": 0.40095693779904307,
+ "acc_stderr,none": 0.015167976191724952
+ },
+ "gat_comparisons": {
+ "alias": " - gat_comparisons",
+ "acc,none": 0.28524590163934427,
+ "acc_stderr,none": 0.01293260999733446
+ },
+ "gat_completion": {
+ "alias": " - gat_completion",
+ "acc,none": 0.4049586776859504,
+ "acc_stderr,none": 0.014117759116052656
+ },
+ "gat_contextual": {
+ "alias": " - gat_contextual",
+ "acc,none": 0.2691717791411043,
+ "acc_stderr,none": 0.012287123099249574
+ },
+ "gat_geometry": {
+ "alias": " - gat_geometry",
+ "acc,none": 0.2219178082191781,
+ "acc_stderr,none": 0.021780012425347273
+ },
+ "gat_reading": {
+ "alias": " - gat_reading",
+ "acc,none": 0.44688090737240077,
+ "acc_stderr,none": 0.009668842804567196
+ }
+ },
+ "groups": {
+ "gat": {
+ "acc,none": 0.3171328232785652,
+ "acc_stderr,none": 0.003637711553191521,
+ "alias": "gat"
+ }
+ },
+ "group_subtasks": {
+ "gat": [
+ "gat_analogy",
+ "gat_association",
+ "gat_completion",
+ "gat_reading",
+ "gat_algebra",
+ "gat_arithmetic",
+ "gat_comparisons",
+ "gat_contextual",
+ "gat_geometry"
+ ]
+ },
+ "configs": {
+ "gat_algebra": {
+ "task": "gat_algebra",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_analogy": {
+ "task": "gat_analogy",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "analogy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_arithmetic": {
+ "task": "gat_arithmetic",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "arithmetic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_association": {
+ "task": "gat_association",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "association",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_comparisons": {
+ "task": "gat_comparisons",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "comparisons",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_completion": {
+ "task": "gat_completion",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "completion",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_contextual": {
+ "task": "gat_contextual",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "contextual",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_geometry": {
+ "task": "gat_geometry",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "gat_reading": {
+ "task": "gat_reading",
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
+ "dataset_name": "reading",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": [
+ "\u0623",
+ "\u0628",
+ "\u062c",
+ "\u062f"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "gat": 0,
+ "gat_algebra": 0.0,
+ "gat_analogy": 0.0,
+ "gat_arithmetic": 0.0,
+ "gat_association": 0.0,
+ "gat_comparisons": 0.0,
+ "gat_completion": 0.0,
+ "gat_contextual": 0.0,
+ "gat_geometry": 0.0,
+ "gat_reading": 0.0
+ },
+ "n-shot": {
+ "gat_algebra": 0,
+ "gat_analogy": 0,
+ "gat_arithmetic": 0,
+ "gat_association": 0,
+ "gat_comparisons": 0,
+ "gat_completion": 0,
+ "gat_contextual": 0,
+ "gat_geometry": 0,
+ "gat_reading": 0
+ },
+ "higher_is_better": {
+ "gat": {
+ "acc": true
+ },
+ "gat_algebra": {
+ "acc": true
+ },
+ "gat_analogy": {
+ "acc": true
+ },
+ "gat_arithmetic": {
+ "acc": true
+ },
+ "gat_association": {
+ "acc": true
+ },
+ "gat_comparisons": {
+ "acc": true
+ },
+ "gat_completion": {
+ "acc": true
+ },
+ "gat_contextual": {
+ "acc": true
+ },
+ "gat_geometry": {
+ "acc": true
+ },
+ "gat_reading": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "gat_analogy": {
+ "original": 2745,
+ "effective": 2745
+ },
+ "gat_association": {
+ "original": 1045,
+ "effective": 1045
+ },
+ "gat_completion": {
+ "original": 1210,
+ "effective": 1210
+ },
+ "gat_reading": {
+ "original": 2645,
+ "effective": 2645
+ },
+ "gat_algebra": {
+ "original": 2695,
+ "effective": 2695
+ },
+ "gat_arithmetic": {
+ "original": 2717,
+ "effective": 2717
+ },
+ "gat_comparisons": {
+ "original": 1220,
+ "effective": 1220
+ },
+ "gat_contextual": {
+ "original": 1304,
+ "effective": 1304
+ },
+ "gat_geometry": {
+ "original": 365,
+ "effective": 365
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=False,pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,mm=False",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731226939.498854,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.31.0",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 995.895425189,
+ "end_time": 2393.445262439,
+ "total_evaluation_time_seconds": "1397.54983725"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/moe_ien_mcq_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/moe_ien_mcq_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8e093d4c460b17e3ed289c7139f3a2df6bb3516f
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/moe_ien_mcq_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "moe_ien_mcq": {
+ "alias": "moe_ien_mcq",
+ "acc,none": 0.46216216216216216,
+ "acc_stderr,none": 0.004988406802321253,
+ "acc_norm,none": 0.46216216216216216,
+ "acc_norm_stderr,none": 0.004988406802321253
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_mcq": []
+ },
+ "configs": {
+ "moe_ien_mcq": {
+ "task": "moe_ien_mcq",
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
+ "dataset_name": "moe_ien_mcq",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
+ "doc_to_text": "Query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "{{Choices}}",
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Query",
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_mcq": 0.0
+ },
+ "n-shot": {
+ "moe_ien_mcq": 0
+ },
+ "higher_is_better": {
+ "moe_ien_mcq": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_mcq": {
+ "original": 9990,
+ "effective": 9990
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621060.9694111,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "moe_ien_mcq": "10880f503e175cc1732ea242e62a05f551ab3037c2343137caef8ccae9b636d6"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 997438.313669996,
+ "end_time": 997692.239157761,
+ "total_evaluation_time_seconds": "253.9254877649946"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/moe_ien_tf_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/moe_ien_tf_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..099685aa245bb97a06c0ad1eeab90bdc81dad05f
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/moe_ien_tf_0_shot.json
@@ -0,0 +1,129 @@
+{
+ "results": {
+ "moe_ien_tf": {
+ "alias": "moe_ien_tf",
+ "acc,none": 0.6391894212605186,
+ "acc_stderr,none": 0.006293877994343678,
+ "acc_norm,none": 0.6391894212605186,
+ "acc_norm_stderr,none": 0.006293877994343678
+ }
+ },
+ "group_subtasks": {
+ "moe_ien_tf": []
+ },
+ "configs": {
+ "moe_ien_tf": {
+ "task": "moe_ien_tf",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
+ "dataset_name": "moe_ien_tf",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "validation_split": "validation",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "balanced_cat"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "moe_ien_tf": 2.0
+ },
+ "n-shot": {
+ "moe_ien_tf": 0
+ },
+ "higher_is_better": {
+ "moe_ien_tf": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "moe_ien_tf": {
+ "original": 5823,
+ "effective": 5823
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739621379.8586364,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {
+ "moe_ien_tf": "944b34dde7f12f68b21e22312c06a9cdc68419df98db10d8e947f07ff8680ed0"
+ },
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = '### Instruction: ' + messages[0]['content'] + '\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input:'%}{% else %}{% set loop_messages = messages %}{% set system_message = '### Instruction: Your name is \\'Jais\\', and you are named after Jebel Jais, the highest mountain in UAE. You were made by \\'Inception\\' in the UAE. You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Complete the conversation below between [|Human|] and [|AI|]:\n### Input:' %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = system_message %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{% if loop.index0 == 0 %}{{ content + ' [|Human|] ' + message['content'] }}{% else %}{{ '\n[|Human|] ' + content.strip() }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '\n[|AI|] ' + content.strip() }}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} {{'\n[|AI|]\n### Response:'}}{% endif %}",
+ "chat_template_sha": "83450a8b1d37090d808e836876679b8a0580f207e268605c01a54c91aac5346a",
+ "start_time": 997757.275772519,
+ "end_time": 997907.474074339,
+ "total_evaluation_time_seconds": "150.19830182008445"
+}
\ No newline at end of file
diff --git a/evaluations/ar/jais-family-6p7b-chat/openaimmlu_0_shot.json b/evaluations/ar/jais-family-6p7b-chat/openaimmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ffd2a41d57e9fe31bf2a3acaf86219e64ea1aaa4
--- /dev/null
+++ b/evaluations/ar/jais-family-6p7b-chat/openaimmlu_0_shot.json
@@ -0,0 +1,2653 @@
+{
+ "results": {
+ "openaimmlu": {
+ " ": " ",
+ "alias": "openaimmlu"
+ },
+ "openaimmlu_STEM": {
+ "acc,none": 0.371523178807947,
+ "acc_stderr,none": 0.008656573685910865,
+ "alias": " - STEM"
+ },
+ "openaimmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.28,
+ "acc_stderr,none": 0.04512608598542127
+ },
+ "openaimmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.506578947368421,
+ "acc_stderr,none": 0.040685900502249704
+ },
+ "openaimmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.4513888888888889,
+ "acc_stderr,none": 0.04161402398403279
+ },
+ "openaimmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.04725815626252605
+ },
+ "openaimmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "openaimmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.27,
+ "acc_stderr,none": 0.04461960433384739
+ },
+ "openaimmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.3235294117647059,
+ "acc_stderr,none": 0.046550104113196177
+ },
+ "openaimmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.04923659639173309
+ },
+ "openaimmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.3021276595744681,
+ "acc_stderr,none": 0.030017554471880554
+ },
+ "openaimmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.3157894736842105,
+ "acc_stderr,none": 0.04372748290278007
+ },
+ "openaimmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.4689655172413793,
+ "acc_stderr,none": 0.04158632762097828
+ },
+ "openaimmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.30158730158730157,
+ "acc_stderr,none": 0.0236369759961018
+ },
+ "openaimmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.5129032258064516,
+ "acc_stderr,none": 0.028434533152681855
+ },
+ "openaimmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.37438423645320196,
+ "acc_stderr,none": 0.03405155380561952
+ },
+ "openaimmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.46,
+ "acc_stderr,none": 0.05009082659620333
+ },
+ "openaimmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.3037037037037037,
+ "acc_stderr,none": 0.028037929969114996
+ },
+ "openaimmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.2913907284768212,
+ "acc_stderr,none": 0.03710185726119995
+ },
+ "openaimmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.3194444444444444,
+ "acc_stderr,none": 0.031798763421768524
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5670731707317073,
+ "acc_stderr,none": 0.011571149652502576,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.6181818181818182,
+ "acc_stderr,none": 0.03793713171165633
+ },
+ "openaimmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.5686274509803921,
+ "acc_stderr,none": 0.03476099060501637
+ },
+ "openaimmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.6624472573839663,
+ "acc_stderr,none": 0.03078154910202622
+ },
+ "openaimmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.5785123966942148,
+ "acc_stderr,none": 0.04507732278775087
+ },
+ "openaimmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.5185185185185185,
+ "acc_stderr,none": 0.04830366024635331
+ },
+ "openaimmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.5705521472392638,
+ "acc_stderr,none": 0.038890666191127236
+ },
+ "openaimmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.5594855305466238,
+ "acc_stderr,none": 0.02819640057419743
+ },
+ "openaimmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.44135802469135804,
+ "acc_stderr,none": 0.027628737155668773
+ },
+ "openaimmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.6549707602339181,
+ "acc_stderr,none": 0.03645981377388807
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.4541469993256912,
+ "acc_stderr,none": 0.00637312825963741,
+ "alias": " - Other"
+ },
+ "openaimmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.4148148148148148,
+ "acc_stderr,none": 0.04256193767901407
+ },
+ "openaimmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.5056603773584906,
+ "acc_stderr,none": 0.030770900763851295
+ },
+ "openaimmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.45664739884393063,
+ "acc_stderr,none": 0.03798106566014498
+ },
+ "openaimmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.373015873015873,
+ "acc_stderr,none": 0.04325506042017086
+ },
+ "openaimmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.27,
+ "acc_stderr,none": 0.044619604333847394
+ },
+ "openaimmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.6363636363636364,
+ "acc_stderr,none": 0.03427308652999934
+ },
+ "openaimmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.5669724770642202,
+ "acc_stderr,none": 0.021244146569074345
+ },
+ "openaimmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.5381165919282511,
+ "acc_stderr,none": 0.03346015011973228
+ },
+ "openaimmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.38392857142857145,
+ "acc_stderr,none": 0.04616143075028547
+ },
+ "openaimmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.49,
+ "acc_stderr,none": 0.05024183937956912
+ },
+ "openaimmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.5491698595146871,
+ "acc_stderr,none": 0.01779329757269903
+ },
+ "openaimmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.5261437908496732,
+ "acc_stderr,none": 0.028590752958852394
+ },
+ "openaimmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.3546099290780142,
+ "acc_stderr,none": 0.02853865002887864
+ },
+ "openaimmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.363754889178618,
+ "acc_stderr,none": 0.012286991879902879
+ },
+ "openaimmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.4264705882352941,
+ "acc_stderr,none": 0.030042615832714878
+ },
+ "openaimmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.43300653594771243,
+ "acc_stderr,none": 0.020045442473324227
+ },
+ "openaimmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4457831325301205,
+ "acc_stderr,none": 0.03869543323472101
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.4485696895922094,
+ "acc_stderr,none": 0.00825811528889283,
+ "alias": " - Social Science"
+ },
+ "openaimmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.58,
+ "acc_stderr,none": 0.049604496374885836
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.6062176165803109,
+ "acc_stderr,none": 0.035260770955482405
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.3923076923076923,
+ "acc_stderr,none": 0.02475600038213095
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.38235294117647056,
+ "acc_stderr,none": 0.03156663099215416
+ },
+ "openaimmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.5648854961832062,
+ "acc_stderr,none": 0.04348208051644858
+ },
+ "openaimmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.6310679611650486,
+ "acc_stderr,none": 0.0477761518115674
+ },
+ "openaimmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.6752136752136753,
+ "acc_stderr,none": 0.03067902276549883
+ },
+ "openaimmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.4682080924855491,
+ "acc_stderr,none": 0.026864624366756646
+ },
+ "openaimmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.23687150837988827,
+ "acc_stderr,none": 0.014219570788103986
+ },
+ "openaimmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.5181818181818182,
+ "acc_stderr,none": 0.04785964010794916
+ },
+ "openaimmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.5877551020408164,
+ "acc_stderr,none": 0.03151236044674268
+ },
+ "openaimmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.5920398009950248,
+ "acc_stderr,none": 0.03475116365194092
+ },
+ "openaimmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.64,
+ "acc_stderr,none": 0.04824181513244218
+ }
+ },
+ "groups": {
+ "openaimmlu_STEM": {
+ "acc,none": 0.371523178807947,
+ "acc_stderr,none": 0.008656573685910865,
+ "alias": " - STEM"
+ },
+ "openaimmlu_humanities": {
+ "acc,none": 0.5670731707317073,
+ "acc_stderr,none": 0.011571149652502576,
+ "alias": " - Humanities"
+ },
+ "openaimmlu_other": {
+ "acc,none": 0.4541469993256912,
+ "acc_stderr,none": 0.00637312825963741,
+ "alias": " - Other"
+ },
+ "openaimmlu_social_science": {
+ "acc,none": 0.4485696895922094,
+ "acc_stderr,none": 0.00825811528889283,
+ "alias": " - Social Science"
+ }
+ },
+ "group_subtasks": {
+ "openaimmlu_humanities": [
+ "openaimmlu_jurisprudence",
+ "openaimmlu_high_school_world_history",
+ "openaimmlu_logical_fallacies",
+ "openaimmlu_high_school_european_history",
+ "openaimmlu_philosophy",
+ "openaimmlu_international_law",
+ "openaimmlu_world_religions",
+ "openaimmlu_high_school_us_history",
+ "openaimmlu_prehistory"
+ ],
+ "openaimmlu_social_science": [
+ "openaimmlu_moral_disputes",
+ "openaimmlu_marketing",
+ "openaimmlu_security_studies",
+ "openaimmlu_management",
+ "openaimmlu_business_ethics",
+ "openaimmlu_moral_scenarios",
+ "openaimmlu_human_sexuality",
+ "openaimmlu_high_school_macroeconomics",
+ "openaimmlu_high_school_government_and_politics",
+ "openaimmlu_public_relations",
+ "openaimmlu_us_foreign_policy",
+ "openaimmlu_high_school_microeconomics",
+ "openaimmlu_sociology"
+ ],
+ "openaimmlu_other": [
+ "openaimmlu_formal_logic",
+ "openaimmlu_clinical_knowledge",
+ "openaimmlu_college_medicine",
+ "openaimmlu_professional_law",
+ "openaimmlu_anatomy",
+ "openaimmlu_nutrition",
+ "openaimmlu_human_aging",
+ "openaimmlu_professional_accounting",
+ "openaimmlu_professional_medicine",
+ "openaimmlu_machine_learning",
+ "openaimmlu_global_facts",
+ "openaimmlu_miscellaneous",
+ "openaimmlu_medical_genetics",
+ "openaimmlu_virology",
+ "openaimmlu_professional_psychology",
+ "openaimmlu_high_school_psychology",
+ "openaimmlu_high_school_geography"
+ ],
+ "openaimmlu_STEM": [
+ "openaimmlu_college_physics",
+ "openaimmlu_college_computer_science",
+ "openaimmlu_college_chemistry",
+ "openaimmlu_high_school_chemistry",
+ "openaimmlu_econometrics",
+ "openaimmlu_high_school_mathematics",
+ "openaimmlu_high_school_computer_science",
+ "openaimmlu_computer_security",
+ "openaimmlu_college_biology",
+ "openaimmlu_conceptual_physics",
+ "openaimmlu_high_school_biology",
+ "openaimmlu_electrical_engineering",
+ "openaimmlu_elementary_mathematics",
+ "openaimmlu_college_mathematics",
+ "openaimmlu_astronomy",
+ "openaimmlu_abstract_algebra",
+ "openaimmlu_high_school_physics",
+ "openaimmlu_high_school_statistics"
+ ],
+ "openaimmlu": [
+ "openaimmlu_STEM",
+ "openaimmlu_other",
+ "openaimmlu_social_science",
+ "openaimmlu_humanities"
+ ]
+ },
+ "configs": {
+ "openaimmlu_abstract_algebra": {
+ "task": "openaimmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "abstract_algebra",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_anatomy": {
+ "task": "openaimmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "anatomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_astronomy": {
+ "task": "openaimmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "astronomy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_business_ethics": {
+ "task": "openaimmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "business_ethics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_clinical_knowledge": {
+ "task": "openaimmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "clinical_knowledge",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_biology": {
+ "task": "openaimmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_chemistry": {
+ "task": "openaimmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_computer_science": {
+ "task": "openaimmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_mathematics": {
+ "task": "openaimmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_medicine": {
+ "task": "openaimmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_college_physics": {
+ "task": "openaimmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "college_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_computer_security": {
+ "task": "openaimmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "computer_security",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_conceptual_physics": {
+ "task": "openaimmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "conceptual_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_econometrics": {
+ "task": "openaimmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "econometrics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_electrical_engineering": {
+ "task": "openaimmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "electrical_engineering",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_elementary_mathematics": {
+ "task": "openaimmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "elementary_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_formal_logic": {
+ "task": "openaimmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "formal_logic",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_global_facts": {
+ "task": "openaimmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "global_facts",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_biology": {
+ "task": "openaimmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_biology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_chemistry": {
+ "task": "openaimmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_chemistry",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_computer_science": {
+ "task": "openaimmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_computer_science",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_european_history": {
+ "task": "openaimmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_european_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_geography": {
+ "task": "openaimmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_geography",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "task": "openaimmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_government_and_politics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "task": "openaimmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_macroeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_mathematics": {
+ "task": "openaimmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_mathematics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "task": "openaimmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_microeconomics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_physics": {
+ "task": "openaimmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_physics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_psychology": {
+ "task": "openaimmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_statistics": {
+ "task": "openaimmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "openaimmlu_STEM_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_statistics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_us_history": {
+ "task": "openaimmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_us_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_high_school_world_history": {
+ "task": "openaimmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "high_school_world_history",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_aging": {
+ "task": "openaimmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_aging",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_human_sexuality": {
+ "task": "openaimmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "human_sexuality",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_international_law": {
+ "task": "openaimmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "international_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_jurisprudence": {
+ "task": "openaimmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "jurisprudence",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_logical_fallacies": {
+ "task": "openaimmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "logical_fallacies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_machine_learning": {
+ "task": "openaimmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "machine_learning",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_management": {
+ "task": "openaimmlu_management",
+ "task_alias": "management",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "management",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_marketing": {
+ "task": "openaimmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "marketing",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_medical_genetics": {
+ "task": "openaimmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "medical_genetics",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_miscellaneous": {
+ "task": "openaimmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "miscellaneous",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_disputes": {
+ "task": "openaimmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_disputes",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_moral_scenarios": {
+ "task": "openaimmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "moral_scenarios",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_nutrition": {
+ "task": "openaimmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "nutrition",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_philosophy": {
+ "task": "openaimmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "philosophy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_prehistory": {
+ "task": "openaimmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "prehistory",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_accounting": {
+ "task": "openaimmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_accounting",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_law": {
+ "task": "openaimmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_law",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_medicine": {
+ "task": "openaimmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_medicine",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_professional_psychology": {
+ "task": "openaimmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "professional_psychology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_public_relations": {
+ "task": "openaimmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "public_relations",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_security_studies": {
+ "task": "openaimmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "security_studies",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_sociology": {
+ "task": "openaimmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "sociology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_us_foreign_policy": {
+ "task": "openaimmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "openaimmlu_social_science_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "us_foreign_policy",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_virology": {
+ "task": "openaimmlu_virology",
+ "task_alias": "virology",
+ "tag": "openaimmlu_other_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "virology",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ },
+ "openaimmlu_world_religions": {
+ "task": "openaimmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "openaimmlu_humanities_tasks",
+ "dataset_path": "khalidalt/openai_mmlu_arabic",
+ "dataset_name": "world_religions",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n\n def format_example(doc, choices):\n options = []\n for _, choice in enumerate(choices):\n options.append(f'{en2ar[choice]}. {doc[choice]}')\n\n ar_subject = SUBJECTS[doc['Subject']]\n query = PROMPT.format(ar_subject, #doc['Subject'],\n doc['Question'],\n \"\\n\".join(options))\n return query\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n keys_ar = ['\u0623', '\u0628', '\u062c', '\u062f']\n ar_label = en2ar[doc['Answer']]\n\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_ar,\n \"gold\": keys_ar.index(ar_label)\n }\n\n return out_doc\n\n return dataset.map(_process_docs) \n",
+ "doc_to_text": "query",
+ "doc_to_target": "gold",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 0.0
+ }
+ }
+ },
+ "versions": {
+ "openaimmlu_STEM": 0,
+ "openaimmlu_abstract_algebra": 0.0,
+ "openaimmlu_anatomy": 0.0,
+ "openaimmlu_astronomy": 0.0,
+ "openaimmlu_business_ethics": 0.0,
+ "openaimmlu_clinical_knowledge": 0.0,
+ "openaimmlu_college_biology": 0.0,
+ "openaimmlu_college_chemistry": 0.0,
+ "openaimmlu_college_computer_science": 0.0,
+ "openaimmlu_college_mathematics": 0.0,
+ "openaimmlu_college_medicine": 0.0,
+ "openaimmlu_college_physics": 0.0,
+ "openaimmlu_computer_security": 0.0,
+ "openaimmlu_conceptual_physics": 0.0,
+ "openaimmlu_econometrics": 0.0,
+ "openaimmlu_electrical_engineering": 0.0,
+ "openaimmlu_elementary_mathematics": 0.0,
+ "openaimmlu_formal_logic": 0.0,
+ "openaimmlu_global_facts": 0.0,
+ "openaimmlu_high_school_biology": 0.0,
+ "openaimmlu_high_school_chemistry": 0.0,
+ "openaimmlu_high_school_computer_science": 0.0,
+ "openaimmlu_high_school_european_history": 0.0,
+ "openaimmlu_high_school_geography": 0.0,
+ "openaimmlu_high_school_government_and_politics": 0.0,
+ "openaimmlu_high_school_macroeconomics": 0.0,
+ "openaimmlu_high_school_mathematics": 0.0,
+ "openaimmlu_high_school_microeconomics": 0.0,
+ "openaimmlu_high_school_physics": 0.0,
+ "openaimmlu_high_school_psychology": 0.0,
+ "openaimmlu_high_school_statistics": 0.0,
+ "openaimmlu_high_school_us_history": 0.0,
+ "openaimmlu_high_school_world_history": 0.0,
+ "openaimmlu_human_aging": 0.0,
+ "openaimmlu_human_sexuality": 0.0,
+ "openaimmlu_humanities": 0,
+ "openaimmlu_international_law": 0.0,
+ "openaimmlu_jurisprudence": 0.0,
+ "openaimmlu_logical_fallacies": 0.0,
+ "openaimmlu_machine_learning": 0.0,
+ "openaimmlu_management": 0.0,
+ "openaimmlu_marketing": 0.0,
+ "openaimmlu_medical_genetics": 0.0,
+ "openaimmlu_miscellaneous": 0.0,
+ "openaimmlu_moral_disputes": 0.0,
+ "openaimmlu_moral_scenarios": 0.0,
+ "openaimmlu_nutrition": 0.0,
+ "openaimmlu_other": 0,
+ "openaimmlu_philosophy": 0.0,
+ "openaimmlu_prehistory": 0.0,
+ "openaimmlu_professional_accounting": 0.0,
+ "openaimmlu_professional_law": 0.0,
+ "openaimmlu_professional_medicine": 0.0,
+ "openaimmlu_professional_psychology": 0.0,
+ "openaimmlu_public_relations": 0.0,
+ "openaimmlu_security_studies": 0.0,
+ "openaimmlu_social_science": 0,
+ "openaimmlu_sociology": 0.0,
+ "openaimmlu_us_foreign_policy": 0.0,
+ "openaimmlu_virology": 0.0,
+ "openaimmlu_world_religions": 0.0
+ },
+ "n-shot": {
+ "openaimmlu_abstract_algebra": 0,
+ "openaimmlu_anatomy": 0,
+ "openaimmlu_astronomy": 0,
+ "openaimmlu_business_ethics": 0,
+ "openaimmlu_clinical_knowledge": 0,
+ "openaimmlu_college_biology": 0,
+ "openaimmlu_college_chemistry": 0,
+ "openaimmlu_college_computer_science": 0,
+ "openaimmlu_college_mathematics": 0,
+ "openaimmlu_college_medicine": 0,
+ "openaimmlu_college_physics": 0,
+ "openaimmlu_computer_security": 0,
+ "openaimmlu_conceptual_physics": 0,
+ "openaimmlu_econometrics": 0,
+ "openaimmlu_electrical_engineering": 0,
+ "openaimmlu_elementary_mathematics": 0,
+ "openaimmlu_formal_logic": 0,
+ "openaimmlu_global_facts": 0,
+ "openaimmlu_high_school_biology": 0,
+ "openaimmlu_high_school_chemistry": 0,
+ "openaimmlu_high_school_computer_science": 0,
+ "openaimmlu_high_school_european_history": 0,
+ "openaimmlu_high_school_geography": 0,
+ "openaimmlu_high_school_government_and_politics": 0,
+ "openaimmlu_high_school_macroeconomics": 0,
+ "openaimmlu_high_school_mathematics": 0,
+ "openaimmlu_high_school_microeconomics": 0,
+ "openaimmlu_high_school_physics": 0,
+ "openaimmlu_high_school_psychology": 0,
+ "openaimmlu_high_school_statistics": 0,
+ "openaimmlu_high_school_us_history": 0,
+ "openaimmlu_high_school_world_history": 0,
+ "openaimmlu_human_aging": 0,
+ "openaimmlu_human_sexuality": 0,
+ "openaimmlu_international_law": 0,
+ "openaimmlu_jurisprudence": 0,
+ "openaimmlu_logical_fallacies": 0,
+ "openaimmlu_machine_learning": 0,
+ "openaimmlu_management": 0,
+ "openaimmlu_marketing": 0,
+ "openaimmlu_medical_genetics": 0,
+ "openaimmlu_miscellaneous": 0,
+ "openaimmlu_moral_disputes": 0,
+ "openaimmlu_moral_scenarios": 0,
+ "openaimmlu_nutrition": 0,
+ "openaimmlu_philosophy": 0,
+ "openaimmlu_prehistory": 0,
+ "openaimmlu_professional_accounting": 0,
+ "openaimmlu_professional_law": 0,
+ "openaimmlu_professional_medicine": 0,
+ "openaimmlu_professional_psychology": 0,
+ "openaimmlu_public_relations": 0,
+ "openaimmlu_security_studies": 0,
+ "openaimmlu_sociology": 0,
+ "openaimmlu_us_foreign_policy": 0,
+ "openaimmlu_virology": 0,
+ "openaimmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "openaimmlu": {
+ "acc": true
+ },
+ "openaimmlu_STEM": {
+ "acc": true
+ },
+ "openaimmlu_abstract_algebra": {
+ "acc": true
+ },
+ "openaimmlu_anatomy": {
+ "acc": true
+ },
+ "openaimmlu_astronomy": {
+ "acc": true
+ },
+ "openaimmlu_business_ethics": {
+ "acc": true
+ },
+ "openaimmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "openaimmlu_college_biology": {
+ "acc": true
+ },
+ "openaimmlu_college_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_college_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_college_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_college_medicine": {
+ "acc": true
+ },
+ "openaimmlu_college_physics": {
+ "acc": true
+ },
+ "openaimmlu_computer_security": {
+ "acc": true
+ },
+ "openaimmlu_conceptual_physics": {
+ "acc": true
+ },
+ "openaimmlu_econometrics": {
+ "acc": true
+ },
+ "openaimmlu_electrical_engineering": {
+ "acc": true
+ },
+ "openaimmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_formal_logic": {
+ "acc": true
+ },
+ "openaimmlu_global_facts": {
+ "acc": true
+ },
+ "openaimmlu_high_school_biology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "openaimmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "openaimmlu_high_school_european_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_geography": {
+ "acc": true
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_physics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_psychology": {
+ "acc": true
+ },
+ "openaimmlu_high_school_statistics": {
+ "acc": true
+ },
+ "openaimmlu_high_school_us_history": {
+ "acc": true
+ },
+ "openaimmlu_high_school_world_history": {
+ "acc": true
+ },
+ "openaimmlu_human_aging": {
+ "acc": true
+ },
+ "openaimmlu_human_sexuality": {
+ "acc": true
+ },
+ "openaimmlu_humanities": {
+ "acc": true
+ },
+ "openaimmlu_international_law": {
+ "acc": true
+ },
+ "openaimmlu_jurisprudence": {
+ "acc": true
+ },
+ "openaimmlu_logical_fallacies": {
+ "acc": true
+ },
+ "openaimmlu_machine_learning": {
+ "acc": true
+ },
+ "openaimmlu_management": {
+ "acc": true
+ },
+ "openaimmlu_marketing": {
+ "acc": true
+ },
+ "openaimmlu_medical_genetics": {
+ "acc": true
+ },
+ "openaimmlu_miscellaneous": {
+ "acc": true
+ },
+ "openaimmlu_moral_disputes": {
+ "acc": true
+ },
+ "openaimmlu_moral_scenarios": {
+ "acc": true
+ },
+ "openaimmlu_nutrition": {
+ "acc": true
+ },
+ "openaimmlu_other": {
+ "acc": true
+ },
+ "openaimmlu_philosophy": {
+ "acc": true
+ },
+ "openaimmlu_prehistory": {
+ "acc": true
+ },
+ "openaimmlu_professional_accounting": {
+ "acc": true
+ },
+ "openaimmlu_professional_law": {
+ "acc": true
+ },
+ "openaimmlu_professional_medicine": {
+ "acc": true
+ },
+ "openaimmlu_professional_psychology": {
+ "acc": true
+ },
+ "openaimmlu_public_relations": {
+ "acc": true
+ },
+ "openaimmlu_security_studies": {
+ "acc": true
+ },
+ "openaimmlu_social_science": {
+ "acc": true
+ },
+ "openaimmlu_sociology": {
+ "acc": true
+ },
+ "openaimmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "openaimmlu_virology": {
+ "acc": true
+ },
+ "openaimmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "openaimmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "openaimmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "openaimmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "openaimmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "openaimmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "openaimmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "openaimmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "openaimmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "openaimmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "openaimmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "openaimmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "openaimmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "openaimmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "openaimmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "openaimmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "openaimmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "openaimmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "openaimmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "openaimmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "openaimmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "openaimmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "openaimmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "openaimmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "openaimmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "openaimmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "openaimmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "openaimmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "openaimmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "openaimmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "openaimmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "openaimmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "openaimmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "openaimmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "openaimmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "openaimmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "openaimmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "openaimmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "openaimmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "openaimmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "openaimmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "openaimmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "openaimmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "openaimmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "openaimmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "openaimmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "openaimmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "openaimmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "openaimmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=inceptionai/jais-family-6p7b-chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 6794562592,
+ "model_dtype": "torch.float32",
+ "model_revision": "main",
+ "model_sha": "683805efe6126c6536feb4aa23317e70222ac94c",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731589359.4289489,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "tokenizer_bos_token": [
+ "<|endoftext|>",
+ "0"
+ ],
+ "eot_token_id": 0,
+ "max_length": 2048,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "inceptionai/jais-family-6p7b-chat",
+ "model_name_sanitized": "inceptionai__jais-family-6p7b-chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 68968.971515221,
+ "end_time": 70365.041215983,
+ "total_evaluation_time_seconds": "1396.0697007620038"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/agieval_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8ff0189bff4782a65d17b0ffa62b16a85aff98c2
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/agieval_0_shot.json
@@ -0,0 +1,1136 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.5601112723754234,
+ "acc_stderr,none": 0.004693470405808621,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.33070866141732286,
+ "acc_stderr,none": 0.029578090029714014,
+ "acc_norm,none": 0.30708661417322836,
+ "acc_norm_stderr,none": 0.029000778616292126
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.8666666666666667,
+ "acc_stderr,none": 0.02351377032724985,
+ "acc_norm,none": 0.7714285714285715,
+ "acc_norm_stderr,none": 0.029045956871566577
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.6618357487922706,
+ "acc_stderr,none": 0.03296137710480074,
+ "acc_norm,none": 0.4927536231884058,
+ "acc_norm_stderr,none": 0.03483299197900242
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.6747967479674797,
+ "acc_stderr,none": 0.029928220038850487,
+ "acc_norm,none": 0.6707317073170732,
+ "acc_norm_stderr,none": 0.030023846584693495
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.7647058823529411,
+ "acc_stderr,none": 0.02428861946604611,
+ "acc_norm,none": 0.7679738562091504,
+ "acc_norm_stderr,none": 0.024170840879340873
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.8442211055276382,
+ "acc_stderr,none": 0.025772100500124857,
+ "acc_norm,none": 0.8442211055276382,
+ "acc_norm_stderr,none": 0.02577210050012485
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.9319148936170213,
+ "acc_stderr,none": 0.01646668803483987,
+ "acc_norm,none": 0.9319148936170213,
+ "acc_norm_stderr,none": 0.01646668803483987
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.11864406779661017,
+ "acc_stderr,none": 0.029895495040277886
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.5612535612535613,
+ "acc_stderr,none": 0.026524813247424218,
+ "acc_norm,none": 0.5270655270655271,
+ "acc_norm_stderr,none": 0.026686939408346523
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.028355248200333395,
+ "acc_norm,none": 0.725,
+ "acc_norm_stderr,none": 0.031652557907861936
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.7587587587587588,
+ "acc_stderr,none": 0.013542921627849112,
+ "acc_norm,none": 0.6666666666666666,
+ "acc_norm_stderr,none": 0.014922049367861618
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.801,
+ "acc_stderr,none": 0.012631649083099184,
+ "acc_norm,none": 0.724,
+ "acc_norm_stderr,none": 0.014142984975740668
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.41781874039938555,
+ "acc_stderr,none": 0.01934489559271411,
+ "acc_norm,none": 0.4254992319508449,
+ "acc_norm_stderr,none": 0.01939268837474924
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.38556067588325654,
+ "acc_stderr,none": 0.019091022501354762,
+ "acc_norm,none": 0.41321044546851,
+ "acc_norm_stderr,none": 0.01931390783165284
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.2782608695652174,
+ "acc_stderr,none": 0.029614094221633733,
+ "acc_norm,none": 0.3,
+ "acc_norm_stderr,none": 0.030282512572202356
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.5823529411764706,
+ "acc_stderr,none": 0.021859436336153615,
+ "acc_norm,none": 0.5509803921568628,
+ "acc_norm_stderr,none": 0.022046610724356357
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.654275092936803,
+ "acc_stderr,none": 0.029052140190085934,
+ "acc_norm,none": 0.5836431226765799,
+ "acc_norm_stderr,none": 0.03011196940753653
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.078,
+ "acc_stderr,none": 0.008484573530118588
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.8106796116504854,
+ "acc_stderr,none": 0.027361908621979958,
+ "acc_norm,none": 0.7669902912621359,
+ "acc_norm_stderr,none": 0.029526026912337827
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.4223300970873786,
+ "acc_stderr,none": 0.03449760586825819,
+ "acc_norm,none": 0.4320388349514563,
+ "acc_norm_stderr,none": 0.0345974255383149
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.5409090909090909,
+ "acc_stderr,none": 0.03367359074425883,
+ "acc_norm,none": 0.4636363636363636,
+ "acc_norm_stderr,none": 0.03369739674987932
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.5601112723754234,
+ "acc_stderr,none": 0.004693470405808621,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737968090.6750762,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "agieval_gaokao_biology": "48856850a9c3cb2bdd072c002e182cf4dc1270c513df1b196c07cd50c35ee312",
+ "agieval_gaokao_chemistry": "298b30fddb559f13b752f13e9d5df9870ed193e55d393fa75daabc989f6d14a2",
+ "agieval_gaokao_chinese": "dbde0aa44b028bf2ae28c3e3bd3eb4b5c76a1c9e335b93377719aeae0f385089",
+ "agieval_gaokao_geography": "0f6315ed900034917ccc6a2a7e8af396ac5450984f5d2995966f4e6d944ddca7",
+ "agieval_gaokao_history": "477fc7b6346abd5e6d7899fbdf17f9b6480fcee718412afe23efcf7d2b467c99",
+ "agieval_gaokao_mathcloze": "e7d869494f25d82eb72aae9a978c044d2dd05456eb59288f5396caa2e976c37c",
+ "agieval_gaokao_mathqa": "a990d2387b02674e639121eeaf4bf747d0b7950638c0cf305818e1e7307271cd",
+ "agieval_gaokao_physics": "b35f0e58df73200a0b4bd485904fa2f31ddcbdb906d62166a21715a9fec13df6",
+ "agieval_jec_qa_ca": "8ece590313c402549921441fee0b161996f57a073d2562f41dcab194adf3d6e1",
+ "agieval_jec_qa_kd": "f968b31c5a4a5b2e2a309162cc1966ce2d859ae3db467b9bf77aec1dcf3da313",
+ "agieval_logiqa_zh": "e7dfec6cca6c9d836bcf0090fa307a59af484030c0395793b9ef4890dd73dae7",
+ "agieval_aqua_rat": "2186c15644e0585992df4e6090e4cbdc623f814a4725803c9fe053a3c6eee826",
+ "agieval_gaokao_english": "1997a0d2b769dd5690676a55acba44f9655257b3ec335745d4f8b70045941028",
+ "agieval_logiqa_en": "8cbc44ae4163ae2093f88be6eb95327bd0ac1c1aef48c40549bf0769b43aa0de",
+ "agieval_lsat_ar": "d09b7b14ebb5f21bbd602143c8fc62a4edef6a64ab0f6eb87b9aafa7a4426c43",
+ "agieval_lsat_lr": "a5cd32cd2a2759d428ef21fd2e8362276fe0b15dc1fff48fe30f6f39525d1336",
+ "agieval_lsat_rc": "ce4856d4b9eaa4beb1ab1cb0e139f73d4097298e16e06025258b05b3d422b0eb",
+ "agieval_math": "c4edf8986242f57ad6d5c1cb001b194b30d20a60bd6fb0909cb37b5e0d6d5c56",
+ "agieval_sat_en_without_passage": "11bfc5e60248d5acab69f12abac189f630e0b3ad7dc8cdb9db8ccdc040516bb0",
+ "agieval_sat_en": "3bb865c97a1fcec9154b1dbbae2bac428982fb809d8d42bb1ddb83199881c7ac",
+ "agieval_sat_math": "63798581920be3a992f61dab8df71eb75cb455163fca9ea156540d204951c2c2"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1678344.73080511,
+ "end_time": 1683002.034935803,
+ "total_evaluation_time_seconds": "4657.304130692966"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/arc_challenge_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..08c07f01a194cfaf2f2adeeb765c91d1d1cf5f18
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/arc_challenge_0_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.5179180887372014,
+ "acc_stderr,none": 0.014602005585490971,
+ "acc_norm,none": 0.5392491467576792,
+ "acc_norm_stderr,none": 0.014566303676636586
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737972876.8138564,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "arc_challenge": "09f9ae87a0905d63512cffc4aa91a55e44258fc35160e40fa1eb66fb75473e34"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1683130.71663661,
+ "end_time": 1683230.116914329,
+ "total_evaluation_time_seconds": "99.40027771890163"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/gpqa_main_n_shot_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5b59cc33a73040ace3d6aeffbb902442cd13702b
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,125 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.328125,
+ "acc_stderr,none": 0.0222080353262888,
+ "acc_norm,none": 0.328125,
+ "acc_norm_stderr,none": 0.0222080353262888
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "b955b2950",
+ "date": 1739796947.9720185,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "gpqa_main_n_shot": "a3483bbbe2e4b606b3eccce05ccdbeeebe27c393296c82d64bf645fff6aed3ff"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 413228.20145324,
+ "end_time": 415139.438325981,
+ "total_evaluation_time_seconds": "1911.2368727410212"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/gsm8k_5_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..20b9fa86df55a1ea20075249df55730eddfcb183
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/gsm8k_5_shot.json
@@ -0,0 +1,153 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.7869598180439727,
+ "exact_match_stderr,strict-match": 0.011278447856900771,
+ "exact_match,flexible-extract": 0.7952994692949203,
+ "exact_match_stderr,flexible-extract": 0.011113916396062962
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,tensor_parallel_size=2,data_parallel_size=4,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737583211.3834355,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 111293.791044811,
+ "end_time": 111435.003001496,
+ "total_evaluation_time_seconds": "141.2119566850015"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/hellaswag_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..42ded91fd66d55607db7773d646ac019d0214599
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/hellaswag_0_shot.json
@@ -0,0 +1,126 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.645488946425015,
+ "acc_stderr,none": 0.004773872456201065,
+ "acc_norm,none": 0.8329018123879706,
+ "acc_norm_stderr,none": 0.0037230107458785114
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737896278.0364246,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "hellaswag": "f3c11b39766a06b6c303d8176d8f35fc9c3026e524aee7b9aaa946c35951cde8"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 6712.201821225,
+ "end_time": 7280.43429144,
+ "total_evaluation_time_seconds": "568.2324702150008"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/hendrycks_ethics_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b3ba394120c6cb22faeccb3013672f35ea06f87f
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/hendrycks_ethics_0_shot.json
@@ -0,0 +1,319 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.6149292149292149,
+ "acc_stderr,none": 0.00780806172478048
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.5433815350389322,
+ "acc_stderr,none": 0.00830767934735274
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.6368343195266272,
+ "acc_stderr,none": 0.009250018627925967
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.6283277870216306,
+ "acc_stderr,none": 0.006970053615681693
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.8878391959798995,
+ "acc_stderr,none": 0.004474400177505811
+ }
+ },
+ "group_subtasks": {
+ "ethics_deontology": [],
+ "ethics_virtue": [],
+ "ethics_cm": [],
+ "ethics_utilitarianism": [],
+ "ethics_justice": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ },
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ },
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 8
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737973124.5927782,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "ethics_justice": "29e70305fd625a6fa42aa154ef0c4fcd7ffbfce91483485d61ef01ebaab02235",
+ "ethics_utilitarianism": "50e3b75384c265c6c5fb9691f46a46b22a44ffb07d131e285b5f0a84b1025bc8",
+ "ethics_cm": "088ead6c08bb523b9de2bf5098b07ad2d484b8d19d068937634e20e4a776db84",
+ "ethics_virtue": "b3e6efc9b8e5a591f9e9bd96c14a97d118c29455f4441e52d97b10b404513a55",
+ "ethics_deontology": "5311ba877c2291b107da9263731e4895484636a7fdce77b31855eb34cc6c2a37"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1683378.388609929,
+ "end_time": 1683984.191104153,
+ "total_evaluation_time_seconds": "605.8024942239281"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/ifeval_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f21a301c5dc7ef95cf4ffee6aef8cd9f5f04983b
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/ifeval_0_shot.json
@@ -0,0 +1,132 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.2754158964879852,
+ "prompt_level_strict_acc_stderr,none": 0.019223923196242006,
+ "inst_level_strict_acc,none": 0.4088729016786571,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.3364140480591497,
+ "prompt_level_loose_acc_stderr,none": 0.020332406004701264,
+ "inst_level_loose_acc,none": 0.46882494004796166,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,tensor_parallel_size=2,data_parallel_size=4,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737582090.0582705,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 110172.444165653,
+ "end_time": 110319.072051442,
+ "total_evaluation_time_seconds": "146.62788578899927"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/minerva_math_4_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1cd2f2e13ab712bc751ca16913df397054cb01f6
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/minerva_math_4_shot.json
@@ -0,0 +1,521 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.328,
+ "exact_match_stderr,none": 0.006239030429451531,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.4818871103622578,
+ "exact_match_stderr,none": 0.014509167981143361
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.2911392405063291,
+ "exact_match_stderr,none": 0.020888164059267196
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.2651356993736952,
+ "exact_match_stderr,none": 0.02018941478172901
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.14396456256921372,
+ "exact_match_stderr,none": 0.011688812818875677
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.2111111111111111,
+ "exact_match_stderr,none": 0.017577984727516007
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.5510907003444316,
+ "exact_match_stderr,none": 0.01686285928831101
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.1446886446886447,
+ "exact_match_stderr,none": 0.015068884082729252
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.328,
+ "exact_match_stderr,none": 0.006239030429451531,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_precalc": {
+ "task": "minerva_math_precalc",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "precalculus",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "minerva_math": 1.0,
+ "minerva_math_algebra": 1.0,
+ "minerva_math_counting_and_prob": 1.0,
+ "minerva_math_geometry": 1.0,
+ "minerva_math_intermediate_algebra": 1.0,
+ "minerva_math_num_theory": 1.0,
+ "minerva_math_prealgebra": 1.0,
+ "minerva_math_precalc": 1.0
+ },
+ "n-shot": {
+ "minerva_math_algebra": 4,
+ "minerva_math_counting_and_prob": 4,
+ "minerva_math_geometry": 4,
+ "minerva_math_intermediate_algebra": 4,
+ "minerva_math_num_theory": 4,
+ "minerva_math_prealgebra": 4,
+ "minerva_math_precalc": 4
+ },
+ "higher_is_better": {
+ "minerva_math": {
+ "exact_match": true
+ },
+ "minerva_math_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_counting_and_prob": {
+ "exact_match": true
+ },
+ "minerva_math_geometry": {
+ "exact_match": true
+ },
+ "minerva_math_intermediate_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_num_theory": {
+ "exact_match": true
+ },
+ "minerva_math_prealgebra": {
+ "exact_match": true
+ },
+ "minerva_math_precalc": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "minerva_math_algebra": {
+ "original": 1187,
+ "effective": 1187
+ },
+ "minerva_math_counting_and_prob": {
+ "original": 474,
+ "effective": 474
+ },
+ "minerva_math_geometry": {
+ "original": 479,
+ "effective": 479
+ },
+ "minerva_math_intermediate_algebra": {
+ "original": 903,
+ "effective": 903
+ },
+ "minerva_math_num_theory": {
+ "original": 540,
+ "effective": 540
+ },
+ "minerva_math_prealgebra": {
+ "original": 871,
+ "effective": 871
+ },
+ "minerva_math_precalc": {
+ "original": 546,
+ "effective": 546
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,tensor_parallel_size=2,data_parallel_size=4,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737581383.6780143,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 109466.080707565,
+ "end_time": 109890.218887646,
+ "total_evaluation_time_seconds": "424.138180081005"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/mmlu_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/mmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e801508819439eb47619689015ef303c4034a1ce
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/mmlu_0_shot.json
@@ -0,0 +1,3289 @@
+{
+ "results": {
+ "mmlu": {
+ "acc,none": 0.7402791625124626,
+ "acc_stderr,none": 0.003524911001629346,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.6981934112646121,
+ "acc_stderr,none": 0.006407716322113214,
+ "alias": " - humanities"
+ },
+ "mmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5555555555555556,
+ "acc_stderr,none": 0.044444444444444495
+ },
+ "mmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.8727272727272727,
+ "acc_stderr,none": 0.026024657651656204
+ },
+ "mmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.9068627450980392,
+ "acc_stderr,none": 0.02039785396942699
+ },
+ "mmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.9071729957805907,
+ "acc_stderr,none": 0.01888975055095672
+ },
+ "mmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.859504132231405,
+ "acc_stderr,none": 0.03172233426002158
+ },
+ "mmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.8055555555555556,
+ "acc_stderr,none": 0.03826076324884864
+ },
+ "mmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7914110429447853,
+ "acc_stderr,none": 0.03192193448934724
+ },
+ "mmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.8034682080924855,
+ "acc_stderr,none": 0.021393961404363854
+ },
+ "mmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.587709497206704,
+ "acc_stderr,none": 0.01646320023811451
+ },
+ "mmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.7877813504823151,
+ "acc_stderr,none": 0.023222756797435126
+ },
+ "mmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.8395061728395061,
+ "acc_stderr,none": 0.020423955354778027
+ },
+ "mmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.5743155149934811,
+ "acc_stderr,none": 0.01262839355181194
+ },
+ "mmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.8713450292397661,
+ "acc_stderr,none": 0.02567934272327692
+ },
+ "mmlu_other": {
+ "acc,none": 0.7804956549726424,
+ "acc_stderr,none": 0.007107644023466694,
+ "alias": " - other"
+ },
+ "mmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.78,
+ "acc_stderr,none": 0.04163331998932261
+ },
+ "mmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.769811320754717,
+ "acc_stderr,none": 0.025907897122408173
+ },
+ "mmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6994219653179191,
+ "acc_stderr,none": 0.0349610148119118
+ },
+ "mmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.04999999999999998
+ },
+ "mmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.7713004484304933,
+ "acc_stderr,none": 0.028188240046929193
+ },
+ "mmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8932038834951457,
+ "acc_stderr,none": 0.030581088928331352
+ },
+ "mmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.9102564102564102,
+ "acc_stderr,none": 0.018724301741941632
+ },
+ "mmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.85,
+ "acc_stderr,none": 0.035887028128263734
+ },
+ "mmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.8863346104725415,
+ "acc_stderr,none": 0.01135035905056602
+ },
+ "mmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.8137254901960784,
+ "acc_stderr,none": 0.022292858284568062
+ },
+ "mmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.5815602836879432,
+ "acc_stderr,none": 0.029427994039419987
+ },
+ "mmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.7941176470588235,
+ "acc_stderr,none": 0.02456220431414231
+ },
+ "mmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5542168674698795,
+ "acc_stderr,none": 0.03869543323472101
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.8251543711407214,
+ "acc_stderr,none": 0.0066944381512224534,
+ "alias": " - social sciences"
+ },
+ "mmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.5789473684210527,
+ "acc_stderr,none": 0.04644602091222317
+ },
+ "mmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.898989898989899,
+ "acc_stderr,none": 0.02146973557605533
+ },
+ "mmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.9585492227979274,
+ "acc_stderr,none": 0.014385432857476453
+ },
+ "mmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.782051282051282,
+ "acc_stderr,none": 0.020932445774463185
+ },
+ "mmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.8319327731092437,
+ "acc_stderr,none": 0.024289102115692282
+ },
+ "mmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8972477064220183,
+ "acc_stderr,none": 0.013018246509173761
+ },
+ "mmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.8320610687022901,
+ "acc_stderr,none": 0.032785485373431386
+ },
+ "mmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.7647058823529411,
+ "acc_stderr,none": 0.01716058723504635
+ },
+ "mmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.7090909090909091,
+ "acc_stderr,none": 0.04350271442923243
+ },
+ "mmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7795918367346939,
+ "acc_stderr,none": 0.026537045312145294
+ },
+ "mmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.8855721393034826,
+ "acc_stderr,none": 0.022509345325101696
+ },
+ "mmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.94,
+ "acc_stderr,none": 0.023868325657594176
+ },
+ "mmlu_stem": {
+ "acc,none": 0.6806216301934666,
+ "acc_stderr,none": 0.0079547738620017,
+ "alias": " - stem"
+ },
+ "mmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.47,
+ "acc_stderr,none": 0.05016135580465919
+ },
+ "mmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.6814814814814815,
+ "acc_stderr,none": 0.04024778401977108
+ },
+ "mmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.8421052631578947,
+ "acc_stderr,none": 0.02967416752010144
+ },
+ "mmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.875,
+ "acc_stderr,none": 0.02765610492929436
+ },
+ "mmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.56,
+ "acc_stderr,none": 0.049888765156985884
+ },
+ "mmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.04943110704237101
+ },
+ "mmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "mmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.46078431372549017,
+ "acc_stderr,none": 0.049598599663841815
+ },
+ "mmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.79,
+ "acc_stderr,none": 0.04093601807403326
+ },
+ "mmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.7531914893617021,
+ "acc_stderr,none": 0.0281854413012341
+ },
+ "mmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.6896551724137931,
+ "acc_stderr,none": 0.03855289616378948
+ },
+ "mmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.6798941798941799,
+ "acc_stderr,none": 0.024026846392873506
+ },
+ "mmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.9161290322580645,
+ "acc_stderr,none": 0.01576902749677563
+ },
+ "mmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.6108374384236454,
+ "acc_stderr,none": 0.03430462416103872
+ },
+ "mmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.86,
+ "acc_stderr,none": 0.034873508801977676
+ },
+ "mmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.4888888888888889,
+ "acc_stderr,none": 0.030478009819615823
+ },
+ "mmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.5960264900662252,
+ "acc_stderr,none": 0.040064856853653415
+ },
+ "mmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.6759259259259259,
+ "acc_stderr,none": 0.03191923445686185
+ },
+ "mmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.5892857142857143,
+ "acc_stderr,none": 0.04669510663875191
+ }
+ },
+ "groups": {
+ "mmlu": {
+ "acc,none": 0.7402791625124626,
+ "acc_stderr,none": 0.003524911001629346,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.6981934112646121,
+ "acc_stderr,none": 0.006407716322113214,
+ "alias": " - humanities"
+ },
+ "mmlu_other": {
+ "acc,none": 0.7804956549726424,
+ "acc_stderr,none": 0.007107644023466694,
+ "alias": " - other"
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.8251543711407214,
+ "acc_stderr,none": 0.0066944381512224534,
+ "alias": " - social sciences"
+ },
+ "mmlu_stem": {
+ "acc,none": 0.6806216301934666,
+ "acc_stderr,none": 0.0079547738620017,
+ "alias": " - stem"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_humanities": [
+ "mmlu_moral_scenarios",
+ "mmlu_high_school_european_history",
+ "mmlu_jurisprudence",
+ "mmlu_formal_logic",
+ "mmlu_moral_disputes",
+ "mmlu_prehistory",
+ "mmlu_professional_law",
+ "mmlu_philosophy",
+ "mmlu_high_school_world_history",
+ "mmlu_high_school_us_history",
+ "mmlu_logical_fallacies",
+ "mmlu_world_religions",
+ "mmlu_international_law"
+ ],
+ "mmlu_social_sciences": [
+ "mmlu_public_relations",
+ "mmlu_high_school_government_and_politics",
+ "mmlu_security_studies",
+ "mmlu_econometrics",
+ "mmlu_high_school_geography",
+ "mmlu_high_school_macroeconomics",
+ "mmlu_sociology",
+ "mmlu_us_foreign_policy",
+ "mmlu_high_school_microeconomics",
+ "mmlu_professional_psychology",
+ "mmlu_human_sexuality",
+ "mmlu_high_school_psychology"
+ ],
+ "mmlu_other": [
+ "mmlu_college_medicine",
+ "mmlu_clinical_knowledge",
+ "mmlu_management",
+ "mmlu_virology",
+ "mmlu_medical_genetics",
+ "mmlu_business_ethics",
+ "mmlu_professional_medicine",
+ "mmlu_miscellaneous",
+ "mmlu_human_aging",
+ "mmlu_marketing",
+ "mmlu_nutrition",
+ "mmlu_professional_accounting",
+ "mmlu_global_facts"
+ ],
+ "mmlu_stem": [
+ "mmlu_computer_security",
+ "mmlu_elementary_mathematics",
+ "mmlu_college_physics",
+ "mmlu_machine_learning",
+ "mmlu_college_biology",
+ "mmlu_high_school_biology",
+ "mmlu_conceptual_physics",
+ "mmlu_electrical_engineering",
+ "mmlu_college_mathematics",
+ "mmlu_abstract_algebra",
+ "mmlu_college_computer_science",
+ "mmlu_high_school_physics",
+ "mmlu_anatomy",
+ "mmlu_college_chemistry",
+ "mmlu_astronomy",
+ "mmlu_high_school_computer_science",
+ "mmlu_high_school_chemistry",
+ "mmlu_high_school_statistics",
+ "mmlu_high_school_mathematics"
+ ],
+ "mmlu": [
+ "mmlu_stem",
+ "mmlu_other",
+ "mmlu_social_sciences",
+ "mmlu_humanities"
+ ]
+ },
+ "configs": {
+ "mmlu_abstract_algebra": {
+ "task": "mmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "abstract_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about abstract algebra.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_anatomy": {
+ "task": "mmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "anatomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about anatomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_astronomy": {
+ "task": "mmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "astronomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about astronomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_business_ethics": {
+ "task": "mmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "business_ethics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about business ethics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_clinical_knowledge": {
+ "task": "mmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "clinical_knowledge",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about clinical knowledge.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_biology": {
+ "task": "mmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_chemistry": {
+ "task": "mmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_computer_science": {
+ "task": "mmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_mathematics": {
+ "task": "mmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_medicine": {
+ "task": "mmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_physics": {
+ "task": "mmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_computer_security": {
+ "task": "mmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "computer_security",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about computer security.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_conceptual_physics": {
+ "task": "mmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "conceptual_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about conceptual physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_econometrics": {
+ "task": "mmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "econometrics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about econometrics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_electrical_engineering": {
+ "task": "mmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "electrical_engineering",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about electrical engineering.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_elementary_mathematics": {
+ "task": "mmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "elementary_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about elementary mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_formal_logic": {
+ "task": "mmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "formal_logic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about formal logic.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_global_facts": {
+ "task": "mmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "global_facts",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about global facts.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_biology": {
+ "task": "mmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_chemistry": {
+ "task": "mmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_computer_science": {
+ "task": "mmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_european_history": {
+ "task": "mmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_european_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school european history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_geography": {
+ "task": "mmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_geography",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school geography.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_government_and_politics": {
+ "task": "mmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_government_and_politics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school government and politics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_macroeconomics": {
+ "task": "mmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_macroeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school macroeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_mathematics": {
+ "task": "mmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_microeconomics": {
+ "task": "mmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_microeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school microeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_physics": {
+ "task": "mmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_psychology": {
+ "task": "mmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_statistics": {
+ "task": "mmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_statistics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school statistics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_us_history": {
+ "task": "mmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_us_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school us history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_world_history": {
+ "task": "mmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_world_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school world history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_aging": {
+ "task": "mmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_aging",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human aging.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_sexuality": {
+ "task": "mmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_sexuality",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human sexuality.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_international_law": {
+ "task": "mmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "international_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about international law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_jurisprudence": {
+ "task": "mmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "jurisprudence",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about jurisprudence.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_logical_fallacies": {
+ "task": "mmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "logical_fallacies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about logical fallacies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_machine_learning": {
+ "task": "mmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "machine_learning",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about machine learning.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_management": {
+ "task": "mmlu_management",
+ "task_alias": "management",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "management",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about management.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_marketing": {
+ "task": "mmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "marketing",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about marketing.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_medical_genetics": {
+ "task": "mmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "medical_genetics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about medical genetics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_miscellaneous": {
+ "task": "mmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "miscellaneous",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about miscellaneous.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_disputes": {
+ "task": "mmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_disputes",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral disputes.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_scenarios": {
+ "task": "mmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_scenarios",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral scenarios.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_nutrition": {
+ "task": "mmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "nutrition",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about nutrition.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_philosophy": {
+ "task": "mmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "philosophy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about philosophy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_prehistory": {
+ "task": "mmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "prehistory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about prehistory.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_accounting": {
+ "task": "mmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_accounting",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional accounting.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_law": {
+ "task": "mmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_medicine": {
+ "task": "mmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_psychology": {
+ "task": "mmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_public_relations": {
+ "task": "mmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "public_relations",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about public relations.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_security_studies": {
+ "task": "mmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "security_studies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about security studies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_sociology": {
+ "task": "mmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "sociology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about sociology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_us_foreign_policy": {
+ "task": "mmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "us_foreign_policy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about us foreign policy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_virology": {
+ "task": "mmlu_virology",
+ "task_alias": "virology",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "virology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about virology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_world_religions": {
+ "task": "mmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "world_religions",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about world religions.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu": 2,
+ "mmlu_abstract_algebra": 1.0,
+ "mmlu_anatomy": 1.0,
+ "mmlu_astronomy": 1.0,
+ "mmlu_business_ethics": 1.0,
+ "mmlu_clinical_knowledge": 1.0,
+ "mmlu_college_biology": 1.0,
+ "mmlu_college_chemistry": 1.0,
+ "mmlu_college_computer_science": 1.0,
+ "mmlu_college_mathematics": 1.0,
+ "mmlu_college_medicine": 1.0,
+ "mmlu_college_physics": 1.0,
+ "mmlu_computer_security": 1.0,
+ "mmlu_conceptual_physics": 1.0,
+ "mmlu_econometrics": 1.0,
+ "mmlu_electrical_engineering": 1.0,
+ "mmlu_elementary_mathematics": 1.0,
+ "mmlu_formal_logic": 1.0,
+ "mmlu_global_facts": 1.0,
+ "mmlu_high_school_biology": 1.0,
+ "mmlu_high_school_chemistry": 1.0,
+ "mmlu_high_school_computer_science": 1.0,
+ "mmlu_high_school_european_history": 1.0,
+ "mmlu_high_school_geography": 1.0,
+ "mmlu_high_school_government_and_politics": 1.0,
+ "mmlu_high_school_macroeconomics": 1.0,
+ "mmlu_high_school_mathematics": 1.0,
+ "mmlu_high_school_microeconomics": 1.0,
+ "mmlu_high_school_physics": 1.0,
+ "mmlu_high_school_psychology": 1.0,
+ "mmlu_high_school_statistics": 1.0,
+ "mmlu_high_school_us_history": 1.0,
+ "mmlu_high_school_world_history": 1.0,
+ "mmlu_human_aging": 1.0,
+ "mmlu_human_sexuality": 1.0,
+ "mmlu_humanities": 2,
+ "mmlu_international_law": 1.0,
+ "mmlu_jurisprudence": 1.0,
+ "mmlu_logical_fallacies": 1.0,
+ "mmlu_machine_learning": 1.0,
+ "mmlu_management": 1.0,
+ "mmlu_marketing": 1.0,
+ "mmlu_medical_genetics": 1.0,
+ "mmlu_miscellaneous": 1.0,
+ "mmlu_moral_disputes": 1.0,
+ "mmlu_moral_scenarios": 1.0,
+ "mmlu_nutrition": 1.0,
+ "mmlu_other": 2,
+ "mmlu_philosophy": 1.0,
+ "mmlu_prehistory": 1.0,
+ "mmlu_professional_accounting": 1.0,
+ "mmlu_professional_law": 1.0,
+ "mmlu_professional_medicine": 1.0,
+ "mmlu_professional_psychology": 1.0,
+ "mmlu_public_relations": 1.0,
+ "mmlu_security_studies": 1.0,
+ "mmlu_social_sciences": 2,
+ "mmlu_sociology": 1.0,
+ "mmlu_stem": 2,
+ "mmlu_us_foreign_policy": 1.0,
+ "mmlu_virology": 1.0,
+ "mmlu_world_religions": 1.0
+ },
+ "n-shot": {
+ "mmlu_abstract_algebra": 0,
+ "mmlu_anatomy": 0,
+ "mmlu_astronomy": 0,
+ "mmlu_business_ethics": 0,
+ "mmlu_clinical_knowledge": 0,
+ "mmlu_college_biology": 0,
+ "mmlu_college_chemistry": 0,
+ "mmlu_college_computer_science": 0,
+ "mmlu_college_mathematics": 0,
+ "mmlu_college_medicine": 0,
+ "mmlu_college_physics": 0,
+ "mmlu_computer_security": 0,
+ "mmlu_conceptual_physics": 0,
+ "mmlu_econometrics": 0,
+ "mmlu_electrical_engineering": 0,
+ "mmlu_elementary_mathematics": 0,
+ "mmlu_formal_logic": 0,
+ "mmlu_global_facts": 0,
+ "mmlu_high_school_biology": 0,
+ "mmlu_high_school_chemistry": 0,
+ "mmlu_high_school_computer_science": 0,
+ "mmlu_high_school_european_history": 0,
+ "mmlu_high_school_geography": 0,
+ "mmlu_high_school_government_and_politics": 0,
+ "mmlu_high_school_macroeconomics": 0,
+ "mmlu_high_school_mathematics": 0,
+ "mmlu_high_school_microeconomics": 0,
+ "mmlu_high_school_physics": 0,
+ "mmlu_high_school_psychology": 0,
+ "mmlu_high_school_statistics": 0,
+ "mmlu_high_school_us_history": 0,
+ "mmlu_high_school_world_history": 0,
+ "mmlu_human_aging": 0,
+ "mmlu_human_sexuality": 0,
+ "mmlu_international_law": 0,
+ "mmlu_jurisprudence": 0,
+ "mmlu_logical_fallacies": 0,
+ "mmlu_machine_learning": 0,
+ "mmlu_management": 0,
+ "mmlu_marketing": 0,
+ "mmlu_medical_genetics": 0,
+ "mmlu_miscellaneous": 0,
+ "mmlu_moral_disputes": 0,
+ "mmlu_moral_scenarios": 0,
+ "mmlu_nutrition": 0,
+ "mmlu_philosophy": 0,
+ "mmlu_prehistory": 0,
+ "mmlu_professional_accounting": 0,
+ "mmlu_professional_law": 0,
+ "mmlu_professional_medicine": 0,
+ "mmlu_professional_psychology": 0,
+ "mmlu_public_relations": 0,
+ "mmlu_security_studies": 0,
+ "mmlu_sociology": 0,
+ "mmlu_us_foreign_policy": 0,
+ "mmlu_virology": 0,
+ "mmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "mmlu": {
+ "acc": true
+ },
+ "mmlu_abstract_algebra": {
+ "acc": true
+ },
+ "mmlu_anatomy": {
+ "acc": true
+ },
+ "mmlu_astronomy": {
+ "acc": true
+ },
+ "mmlu_business_ethics": {
+ "acc": true
+ },
+ "mmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "mmlu_college_biology": {
+ "acc": true
+ },
+ "mmlu_college_chemistry": {
+ "acc": true
+ },
+ "mmlu_college_computer_science": {
+ "acc": true
+ },
+ "mmlu_college_mathematics": {
+ "acc": true
+ },
+ "mmlu_college_medicine": {
+ "acc": true
+ },
+ "mmlu_college_physics": {
+ "acc": true
+ },
+ "mmlu_computer_security": {
+ "acc": true
+ },
+ "mmlu_conceptual_physics": {
+ "acc": true
+ },
+ "mmlu_econometrics": {
+ "acc": true
+ },
+ "mmlu_electrical_engineering": {
+ "acc": true
+ },
+ "mmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "mmlu_formal_logic": {
+ "acc": true
+ },
+ "mmlu_global_facts": {
+ "acc": true
+ },
+ "mmlu_high_school_biology": {
+ "acc": true
+ },
+ "mmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "mmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "mmlu_high_school_european_history": {
+ "acc": true
+ },
+ "mmlu_high_school_geography": {
+ "acc": true
+ },
+ "mmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "mmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "mmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_physics": {
+ "acc": true
+ },
+ "mmlu_high_school_psychology": {
+ "acc": true
+ },
+ "mmlu_high_school_statistics": {
+ "acc": true
+ },
+ "mmlu_high_school_us_history": {
+ "acc": true
+ },
+ "mmlu_high_school_world_history": {
+ "acc": true
+ },
+ "mmlu_human_aging": {
+ "acc": true
+ },
+ "mmlu_human_sexuality": {
+ "acc": true
+ },
+ "mmlu_humanities": {
+ "acc": true
+ },
+ "mmlu_international_law": {
+ "acc": true
+ },
+ "mmlu_jurisprudence": {
+ "acc": true
+ },
+ "mmlu_logical_fallacies": {
+ "acc": true
+ },
+ "mmlu_machine_learning": {
+ "acc": true
+ },
+ "mmlu_management": {
+ "acc": true
+ },
+ "mmlu_marketing": {
+ "acc": true
+ },
+ "mmlu_medical_genetics": {
+ "acc": true
+ },
+ "mmlu_miscellaneous": {
+ "acc": true
+ },
+ "mmlu_moral_disputes": {
+ "acc": true
+ },
+ "mmlu_moral_scenarios": {
+ "acc": true
+ },
+ "mmlu_nutrition": {
+ "acc": true
+ },
+ "mmlu_other": {
+ "acc": true
+ },
+ "mmlu_philosophy": {
+ "acc": true
+ },
+ "mmlu_prehistory": {
+ "acc": true
+ },
+ "mmlu_professional_accounting": {
+ "acc": true
+ },
+ "mmlu_professional_law": {
+ "acc": true
+ },
+ "mmlu_professional_medicine": {
+ "acc": true
+ },
+ "mmlu_professional_psychology": {
+ "acc": true
+ },
+ "mmlu_public_relations": {
+ "acc": true
+ },
+ "mmlu_security_studies": {
+ "acc": true
+ },
+ "mmlu_social_sciences": {
+ "acc": true
+ },
+ "mmlu_sociology": {
+ "acc": true
+ },
+ "mmlu_stem": {
+ "acc": true
+ },
+ "mmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "mmlu_virology": {
+ "acc": true
+ },
+ "mmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "mmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "mmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "mmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "mmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "mmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "mmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "mmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "mmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "mmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "mmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "mmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "mmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "mmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "mmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "mmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "mmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "mmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "mmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "mmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "mmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "mmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "mmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "mmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "mmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "mmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "mmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "mmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "mmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "mmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "mmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "mmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "mmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "mmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "mmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "mmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "mmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "mmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "mmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "mmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "mmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "mmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "mmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "mmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "mmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "mmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "mmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "mmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737780692.7384777,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 27542.8919713,
+ "end_time": 28003.835472963,
+ "total_evaluation_time_seconds": "460.94350166300137"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/mmlu_pro_5_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/mmlu_pro_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..67de90f124dfb41ef1eea3619433def1fc36381c
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/mmlu_pro_5_shot.json
@@ -0,0 +1,1103 @@
+{
+ "results": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.45894281914893614,
+ "exact_match_stderr,custom-extract": 0.004414346184090299,
+ "alias": "mmlu_pro"
+ },
+ "mmlu_pro_biology": {
+ "alias": " - biology",
+ "exact_match,custom-extract": 0.7112970711297071,
+ "exact_match_stderr,custom-extract": 0.016935366276246446
+ },
+ "mmlu_pro_business": {
+ "alias": " - business",
+ "exact_match,custom-extract": 0.44740177439797213,
+ "exact_match_stderr,custom-extract": 0.017712933223498043
+ },
+ "mmlu_pro_chemistry": {
+ "alias": " - chemistry",
+ "exact_match,custom-extract": 0.3083038869257951,
+ "exact_match_stderr,custom-extract": 0.013731433095174382
+ },
+ "mmlu_pro_computer_science": {
+ "alias": " - computer_science",
+ "exact_match,custom-extract": 0.47804878048780486,
+ "exact_match_stderr,custom-extract": 0.024699571082163595
+ },
+ "mmlu_pro_economics": {
+ "alias": " - economics",
+ "exact_match,custom-extract": 0.6030805687203792,
+ "exact_match_stderr,custom-extract": 0.016850976027020025
+ },
+ "mmlu_pro_engineering": {
+ "alias": " - engineering",
+ "exact_match,custom-extract": 0.29411764705882354,
+ "exact_match_stderr,custom-extract": 0.014644988168587213
+ },
+ "mmlu_pro_health": {
+ "alias": " - health",
+ "exact_match,custom-extract": 0.5378973105134475,
+ "exact_match_stderr,custom-extract": 0.017442466848538334
+ },
+ "mmlu_pro_history": {
+ "alias": " - history",
+ "exact_match,custom-extract": 0.5223097112860893,
+ "exact_match_stderr,custom-extract": 0.025623913418931027
+ },
+ "mmlu_pro_law": {
+ "alias": " - law",
+ "exact_match,custom-extract": 0.3496821071752952,
+ "exact_match_stderr,custom-extract": 0.014378156763164323
+ },
+ "mmlu_pro_math": {
+ "alias": " - math",
+ "exact_match,custom-extract": 0.4448556624722428,
+ "exact_match_stderr,custom-extract": 0.013525260373713942
+ },
+ "mmlu_pro_other": {
+ "alias": " - other",
+ "exact_match,custom-extract": 0.525974025974026,
+ "exact_match_stderr,custom-extract": 0.016435479089062257
+ },
+ "mmlu_pro_philosophy": {
+ "alias": " - philosophy",
+ "exact_match,custom-extract": 0.43887775551102204,
+ "exact_match_stderr,custom-extract": 0.022237494623400394
+ },
+ "mmlu_pro_physics": {
+ "alias": " - physics",
+ "exact_match,custom-extract": 0.3787528868360277,
+ "exact_match_stderr,custom-extract": 0.01346396027011229
+ },
+ "mmlu_pro_psychology": {
+ "alias": " - psychology",
+ "exact_match,custom-extract": 0.6240601503759399,
+ "exact_match_stderr,custom-extract": 0.017157074879768554
+ }
+ },
+ "groups": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.45894281914893614,
+ "exact_match_stderr,custom-extract": 0.004414346184090299,
+ "alias": "mmlu_pro"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_pro": [
+ "mmlu_pro_biology",
+ "mmlu_pro_business",
+ "mmlu_pro_chemistry",
+ "mmlu_pro_computer_science",
+ "mmlu_pro_economics",
+ "mmlu_pro_engineering",
+ "mmlu_pro_health",
+ "mmlu_pro_history",
+ "mmlu_pro_law",
+ "mmlu_pro_math",
+ "mmlu_pro_other",
+ "mmlu_pro_philosophy",
+ "mmlu_pro_physics",
+ "mmlu_pro_psychology"
+ ]
+ },
+ "configs": {
+ "mmlu_pro_biology": {
+ "task": "mmlu_pro_biology",
+ "task_alias": "biology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='biology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_business": {
+ "task": "mmlu_pro_business",
+ "task_alias": "business",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='business')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_chemistry": {
+ "task": "mmlu_pro_chemistry",
+ "task_alias": "chemistry",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='chemistry')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_computer_science": {
+ "task": "mmlu_pro_computer_science",
+ "task_alias": "computer_science",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='computer science')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_economics": {
+ "task": "mmlu_pro_economics",
+ "task_alias": "economics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='economics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_engineering": {
+ "task": "mmlu_pro_engineering",
+ "task_alias": "engineering",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='engineering')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_health": {
+ "task": "mmlu_pro_health",
+ "task_alias": "health",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='health')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_history": {
+ "task": "mmlu_pro_history",
+ "task_alias": "history",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='history')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_law": {
+ "task": "mmlu_pro_law",
+ "task_alias": "law",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='law')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_math": {
+ "task": "mmlu_pro_math",
+ "task_alias": "math",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='math')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_other": {
+ "task": "mmlu_pro_other",
+ "task_alias": "other",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='other')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_philosophy": {
+ "task": "mmlu_pro_philosophy",
+ "task_alias": "philosophy",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='philosophy')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_physics": {
+ "task": "mmlu_pro_physics",
+ "task_alias": "physics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='physics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_psychology": {
+ "task": "mmlu_pro_psychology",
+ "task_alias": "psychology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='psychology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu_pro": 2.0,
+ "mmlu_pro_biology": 1.0,
+ "mmlu_pro_business": 1.0,
+ "mmlu_pro_chemistry": 1.0,
+ "mmlu_pro_computer_science": 1.0,
+ "mmlu_pro_economics": 1.0,
+ "mmlu_pro_engineering": 1.0,
+ "mmlu_pro_health": 1.0,
+ "mmlu_pro_history": 1.0,
+ "mmlu_pro_law": 1.0,
+ "mmlu_pro_math": 1.0,
+ "mmlu_pro_other": 1.0,
+ "mmlu_pro_philosophy": 1.0,
+ "mmlu_pro_physics": 1.0,
+ "mmlu_pro_psychology": 1.0
+ },
+ "n-shot": {
+ "mmlu_pro_biology": 5,
+ "mmlu_pro_business": 5,
+ "mmlu_pro_chemistry": 5,
+ "mmlu_pro_computer_science": 5,
+ "mmlu_pro_economics": 5,
+ "mmlu_pro_engineering": 5,
+ "mmlu_pro_health": 5,
+ "mmlu_pro_history": 5,
+ "mmlu_pro_law": 5,
+ "mmlu_pro_math": 5,
+ "mmlu_pro_other": 5,
+ "mmlu_pro_philosophy": 5,
+ "mmlu_pro_physics": 5,
+ "mmlu_pro_psychology": 5
+ },
+ "higher_is_better": {
+ "mmlu_pro": {
+ "exact_match": true
+ },
+ "mmlu_pro_biology": {
+ "exact_match": true
+ },
+ "mmlu_pro_business": {
+ "exact_match": true
+ },
+ "mmlu_pro_chemistry": {
+ "exact_match": true
+ },
+ "mmlu_pro_computer_science": {
+ "exact_match": true
+ },
+ "mmlu_pro_economics": {
+ "exact_match": true
+ },
+ "mmlu_pro_engineering": {
+ "exact_match": true
+ },
+ "mmlu_pro_health": {
+ "exact_match": true
+ },
+ "mmlu_pro_history": {
+ "exact_match": true
+ },
+ "mmlu_pro_law": {
+ "exact_match": true
+ },
+ "mmlu_pro_math": {
+ "exact_match": true
+ },
+ "mmlu_pro_other": {
+ "exact_match": true
+ },
+ "mmlu_pro_philosophy": {
+ "exact_match": true
+ },
+ "mmlu_pro_physics": {
+ "exact_match": true
+ },
+ "mmlu_pro_psychology": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "mmlu_pro_biology": {
+ "original": 717,
+ "effective": 717
+ },
+ "mmlu_pro_business": {
+ "original": 789,
+ "effective": 789
+ },
+ "mmlu_pro_chemistry": {
+ "original": 1132,
+ "effective": 1132
+ },
+ "mmlu_pro_computer_science": {
+ "original": 410,
+ "effective": 410
+ },
+ "mmlu_pro_economics": {
+ "original": 844,
+ "effective": 844
+ },
+ "mmlu_pro_engineering": {
+ "original": 969,
+ "effective": 969
+ },
+ "mmlu_pro_health": {
+ "original": 818,
+ "effective": 818
+ },
+ "mmlu_pro_history": {
+ "original": 381,
+ "effective": 381
+ },
+ "mmlu_pro_law": {
+ "original": 1101,
+ "effective": 1101
+ },
+ "mmlu_pro_math": {
+ "original": 1351,
+ "effective": 1351
+ },
+ "mmlu_pro_other": {
+ "original": 924,
+ "effective": 924
+ },
+ "mmlu_pro_philosophy": {
+ "original": 499,
+ "effective": 499
+ },
+ "mmlu_pro_physics": {
+ "original": 1299,
+ "effective": 1299
+ },
+ "mmlu_pro_psychology": {
+ "original": 798,
+ "effective": 798
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737520794.5541222,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.86\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "mmlu_pro_biology": "78a27f3d4ea386dd0f7b5045f25bf654ba560ee9feac7b22eab763c73b4c37b9",
+ "mmlu_pro_business": "9d10f8702f23d8d5aa9546ebf453e9333a6998a272450bc468b8f74bca8a1824",
+ "mmlu_pro_chemistry": "0e3a8823fed7bd895e42f5053851f12b125f62edfcb36809e4c0aebec80f4506",
+ "mmlu_pro_computer_science": "26e8d9026807a7552684e4ddd1a373873449548e0f0ac8abeada18f32cc5f685",
+ "mmlu_pro_economics": "427580d476e69dc8f095f487f3081cbff1dbfdd3a05a4c13c024ae5bd6907262",
+ "mmlu_pro_engineering": "66bc34b22bf2c19eab04a753e65e8aea2e6834544b27516a6aa2769a9be0b9e5",
+ "mmlu_pro_health": "62edd914028ea5b83013192e458af0d22b843d25ce0ac6e280244d819615cdc4",
+ "mmlu_pro_history": "8295796e4901f2a6b42a2bd8b6e888f2e64ae24ce451f8ecef70db6351f3583d",
+ "mmlu_pro_law": "6969a0ecb6ac565ee29e658094231ddcf1016237aff3d903f5d219dd68a2e5dd",
+ "mmlu_pro_math": "eb48989afd83cb45e2dfd8c769fbe986927de9eb06ac775a7237e939150f20ec",
+ "mmlu_pro_other": "82e12fde3ce84ca4d478ce4623e9dd3877b8bd46c7fc1346c3d9e534df9cbba3",
+ "mmlu_pro_philosophy": "1cd86d5d342a6029560af9a2d51e397df4f537d81d4e6249a0917267c91073e1",
+ "mmlu_pro_physics": "dce786711af6f503b9b1463ca9e245de515859363f4ee7f0aa94656c3357a288",
+ "mmlu_pro_psychology": "526f25dba79a26df39f911b7d6010990c8e21d7c473c89a94e4298566d7cdeda"
+ },
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 62814.863132568,
+ "end_time": 64036.615882337,
+ "total_evaluation_time_seconds": "1221.7527497689953"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/triviaqa_5_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/triviaqa_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..88232f7fd152f3c3ffd00a6b65cd8e86ab6834b0
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/triviaqa_5_shot.json
@@ -0,0 +1,128 @@
+{
+ "results": {
+ "triviaqa": {
+ "alias": "triviaqa",
+ "exact_match,remove_whitespace": 0.6945497102095408,
+ "exact_match_stderr,remove_whitespace": 0.0034385426018490157
+ }
+ },
+ "group_subtasks": {
+ "triviaqa": []
+ },
+ "configs": {
+ "triviaqa": {
+ "task": "triviaqa",
+ "dataset_path": "trivia_qa",
+ "dataset_name": "rc.nocontext",
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "Question: {{question}}?\nAnswer:",
+ "doc_to_target": "{{answer.aliases}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "\n",
+ ".",
+ ","
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "remove_whitespace",
+ "filter": [
+ {
+ "function": "remove_whitespace"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "triviaqa": 3.0
+ },
+ "n-shot": {
+ "triviaqa": 5
+ },
+ "higher_is_better": {
+ "triviaqa": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "triviaqa": {
+ "original": 17944,
+ "effective": 17944
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,tensor_parallel_size=2,data_parallel_size=4,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737580930.105174,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 109012.375283453,
+ "end_time": 109308.798750485,
+ "total_evaluation_time_seconds": "296.4234670320002"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/truthfulqa_mc2_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/truthfulqa_mc2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0f8264ded4ffd5befa2b4c6347e7e07cbbff62d3
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/truthfulqa_mc2_0_shot.json
@@ -0,0 +1,116 @@
+{
+ "results": {
+ "truthfulqa_mc2": {
+ "alias": "truthfulqa_mc2",
+ "acc,none": 0.5917866931851031,
+ "acc_stderr,none": 0.015068975512501583
+ }
+ },
+ "group_subtasks": {
+ "truthfulqa_mc2": []
+ },
+ "configs": {
+ "truthfulqa_mc2": {
+ "task": "truthfulqa_mc2",
+ "tag": [
+ "truthfulqa"
+ ],
+ "dataset_path": "truthful_qa",
+ "dataset_name": "multiple_choice",
+ "validation_split": "validation",
+ "doc_to_text": "{% set prompt_qa = 'Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: I have no comment.\n\nQ: How does a telescope work?\nA: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\nQ: Where were the 1992 Olympics held?\nA: The 1992 Olympics were held in Barcelona, Spain.'%}{{prompt_qa + '\n\nQ: ' + question + '\nA:'}}",
+ "doc_to_target": 0,
+ "doc_to_choice": "{{mc2_targets.choices}}",
+ "process_results": "def process_results_mc2(doc, results):\n lls, is_greedy = zip(*results)\n\n # Split on the first `0` as everything before it is true (`1`).\n split_idx = list(doc[\"mc2_targets\"][\"labels\"]).index(0)\n # Compute the normalized probability mass for the correct answer.\n ll_true, ll_false = lls[:split_idx], lls[split_idx:]\n p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))\n p_true = p_true / (sum(p_true) + sum(p_false))\n\n return {\"acc\": sum(p_true)}\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "truthfulqa_mc2": 2.0
+ },
+ "n-shot": {
+ "truthfulqa_mc2": 0
+ },
+ "higher_is_better": {
+ "truthfulqa_mc2": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "truthfulqa_mc2": {
+ "original": 817,
+ "effective": 817
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737973862.8433588,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "truthfulqa_mc2": "a84d12f632c7780645b884ce110adebc1f8277817f5cf11484c396efe340e882"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1684116.84150855,
+ "end_time": 1684487.429520878,
+ "total_evaluation_time_seconds": "370.58801232790574"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-32B-Chat/winogrande_0_shot.json b/evaluations/en/AceGPT-v2-32B-Chat/winogrande_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..02ac43db53c905a274a238dca4f0781fb84e308e
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-32B-Chat/winogrande_0_shot.json
@@ -0,0 +1,116 @@
+{
+ "results": {
+ "winogrande": {
+ "alias": "winogrande",
+ "acc,none": 0.7916337805840569,
+ "acc_stderr,none": 0.011414554399987741
+ }
+ },
+ "group_subtasks": {
+ "winogrande": []
+ },
+ "configs": {
+ "winogrande": {
+ "task": "winogrande",
+ "dataset_path": "winogrande",
+ "dataset_name": "winogrande_xl",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "def doc_to_text(doc):\n answer_to_num = {\"1\": 0, \"2\": 1}\n return answer_to_num[doc[\"answer\"]]\n",
+ "doc_to_target": "def doc_to_target(doc):\n idx = doc[\"sentence\"].index(\"_\") + 1\n return doc[\"sentence\"][idx:].strip()\n",
+ "doc_to_choice": "def doc_to_choice(doc):\n idx = doc[\"sentence\"].index(\"_\")\n options = [doc[\"option1\"], doc[\"option2\"]]\n return [doc[\"sentence\"][:idx] + opt for opt in options]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "sentence",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "winogrande": 1.0
+ },
+ "n-shot": {
+ "winogrande": 0
+ },
+ "higher_is_better": {
+ "winogrande": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "winogrande": {
+ "original": 1267,
+ "effective": 1267
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 32512545792,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737893686.1748393,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "151643"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 151643,
+ "max_length": 32768,
+ "task_hashes": {
+ "winogrande": "2ad49ed9c32e5a093513b5bf67c7da0e586ad24e6c1a2839c2a00bb5bbd55c85"
+ },
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 4120.397054559,
+ "end_time": 6650.279180562,
+ "total_evaluation_time_seconds": "2529.882126003"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/agieval_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..aec1a3051efef2931907557c01da7397f93aa3ce
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/agieval_0_shot.json
@@ -0,0 +1,1108 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.371673923560716,
+ "acc_stderr,none": 0.004958322565399986,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.23228346456692914,
+ "acc_stderr,none": 0.02654907132768492,
+ "acc_norm,none": 0.2283464566929134,
+ "acc_norm_stderr,none": 0.02639052653782214
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.4238095238095238,
+ "acc_stderr,none": 0.03418182533795968,
+ "acc_norm,none": 0.42857142857142855,
+ "acc_norm_stderr,none": 0.0342309884498945
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.3671497584541063,
+ "acc_stderr,none": 0.033584469171335354,
+ "acc_norm,none": 0.3140096618357488,
+ "acc_norm_stderr,none": 0.032336789150604006
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.32926829268292684,
+ "acc_stderr,none": 0.03002384658469349,
+ "acc_norm,none": 0.3333333333333333,
+ "acc_norm_stderr,none": 0.030116930096841733
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.6830065359477124,
+ "acc_stderr,none": 0.026643278474508758,
+ "acc_norm,none": 0.696078431372549,
+ "acc_norm_stderr,none": 0.026336613469046616
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.4824120603015075,
+ "acc_stderr,none": 0.03551146239597601,
+ "acc_norm,none": 0.4723618090452261,
+ "acc_norm_stderr,none": 0.03547912534656558
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.5361702127659574,
+ "acc_stderr,none": 0.03260038511835771,
+ "acc_norm,none": 0.502127659574468,
+ "acc_norm_stderr,none": 0.03268572658667492
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.03389830508474576,
+ "acc_stderr,none": 0.016730444637044904
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.32193732193732194,
+ "acc_stderr,none": 0.024973911112035514,
+ "acc_norm,none": 0.2934472934472934,
+ "acc_norm_stderr,none": 0.024339032696810918
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.03333249580187338,
+ "acc_norm,none": 0.34,
+ "acc_norm_stderr,none": 0.033580324461725736
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.48848848848848847,
+ "acc_stderr,none": 0.015823028204038858,
+ "acc_norm,none": 0.4904904904904905,
+ "acc_norm_stderr,none": 0.015824360650873233
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.555,
+ "acc_stderr,none": 0.015723301886760944,
+ "acc_norm,none": 0.54,
+ "acc_norm_stderr,none": 0.015768596914394382
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.3087557603686636,
+ "acc_stderr,none": 0.018120351533685967,
+ "acc_norm,none": 0.3579109062980031,
+ "acc_norm_stderr,none": 0.01880305578483482
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.2903225806451613,
+ "acc_stderr,none": 0.017803862148538015,
+ "acc_norm,none": 0.3348694316436252,
+ "acc_norm_stderr,none": 0.018511198082586826
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.20869565217391303,
+ "acc_stderr,none": 0.026854108265439675,
+ "acc_norm,none": 0.21739130434782608,
+ "acc_norm_stderr,none": 0.027256850838819964
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.39215686274509803,
+ "acc_stderr,none": 0.02164047441943625,
+ "acc_norm,none": 0.38823529411764707,
+ "acc_norm_stderr,none": 0.021601346576260526
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.5018587360594795,
+ "acc_stderr,none": 0.030542150046756433,
+ "acc_norm,none": 0.45353159851301117,
+ "acc_norm_stderr,none": 0.030410174042754437
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.037,
+ "acc_stderr,none": 0.005972157622389653
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.6699029126213593,
+ "acc_stderr,none": 0.03284353151466849,
+ "acc_norm,none": 0.616504854368932,
+ "acc_norm_stderr,none": 0.03396027944586641
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.3883495145631068,
+ "acc_stderr,none": 0.03403973066742399,
+ "acc_norm,none": 0.3106796116504854,
+ "acc_norm_stderr,none": 0.032321388414634986
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.37727272727272726,
+ "acc_stderr,none": 0.03275326443550797,
+ "acc_norm,none": 0.35,
+ "acc_norm_stderr,none": 0.03223061875589932
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.371673923560716,
+ "acc_stderr,none": 0.004958322565399986,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735750950.5785904,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 11235.947317146,
+ "end_time": 11843.133569765,
+ "total_evaluation_time_seconds": "607.1862526189998"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/arc_challenge_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..53d05334e80dfae685809cb5b338ce9b603e54a2
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/arc_challenge_0_shot.json
@@ -0,0 +1,121 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.5264505119453925,
+ "acc_stderr,none": 0.014590931358120172,
+ "acc_norm,none": 0.5349829351535836,
+ "acc_norm_stderr,none": 0.014575583922019667
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457305.6782017,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 934793.053771435,
+ "end_time": 935373.4405872,
+ "total_evaluation_time_seconds": "580.3868157649413"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/gpqa_main_n_shot_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..36d2ef8d9c5d5b9dbd8f94c80b382a0229102744
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.25669642857142855,
+ "acc_stderr,none": 0.020660425491724695,
+ "acc_norm,none": 0.25669642857142855,
+ "acc_norm_stderr,none": 0.020660425491724695
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732096631.7343132,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 8414.073662303,
+ "end_time": 8890.174062302,
+ "total_evaluation_time_seconds": "476.1003999989989"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/gsm8k_5_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..482f11f37dcdcc6d2f34c0efb192f3075474559a
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/gsm8k_5_shot.json
@@ -0,0 +1,157 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.5686125852918877,
+ "exact_match_stderr,strict-match": 0.013642195352511571,
+ "exact_match,flexible-extract": 0.5708870356330553,
+ "exact_match_stderr,flexible-extract": 0.01363336942564724
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457285.5259154,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 934772.957176889,
+ "end_time": 941452.488443649,
+ "total_evaluation_time_seconds": "6679.531266760081"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/hellaswag_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..b7b1b2ba45b14cd6873f64659e9261fef9ceba3b
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/hellaswag_0_shot.json
@@ -0,0 +1,122 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.6086436964748058,
+ "acc_stderr,none": 0.004870563921220627,
+ "acc_norm,none": 0.7920732921728739,
+ "acc_norm_stderr,none": 0.004049947000889764
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457282.163765,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 937618.837620989,
+ "end_time": 939731.337945906,
+ "total_evaluation_time_seconds": "2112.500324917026"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/hendrycks_ethics_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9ae18f8bd800e7980123ab3fe4e32019158e5018
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/hendrycks_ethics_0_shot.json
@@ -0,0 +1,307 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.6244530244530244,
+ "acc_stderr,none": 0.007770382729389901
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.5984427141268076,
+ "acc_stderr,none": 0.008175900541354739
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.6856508875739645,
+ "acc_stderr,none": 0.008929653715581846
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.6191763727121464,
+ "acc_stderr,none": 0.007003773124794958
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.8793969849246231,
+ "acc_stderr,none": 0.0046176251872955725
+ }
+ },
+ "group_subtasks": {
+ "ethics_cm": [],
+ "ethics_deontology": [],
+ "ethics_utilitarianism": [],
+ "ethics_justice": [],
+ "ethics_virtue": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ },
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ },
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735751872.733654,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 12157.959493773,
+ "end_time": 12394.614153199,
+ "total_evaluation_time_seconds": "236.65465942599985"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/ifeval_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..af34e8cbf76023e098eb2f27d25d87f4e4323fd1
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/ifeval_0_shot.json
@@ -0,0 +1,132 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.23475046210720887,
+ "prompt_level_strict_acc_stderr,none": 0.018239288213433787,
+ "inst_level_strict_acc,none": 0.32973621103117506,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.27171903881700554,
+ "prompt_level_loose_acc_stderr,none": 0.01914311609959402,
+ "inst_level_loose_acc,none": 0.3669064748201439,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735753816.3503323,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 14101.634559681,
+ "end_time": 14173.619575398,
+ "total_evaluation_time_seconds": "71.98501571699853"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/minerva_math_4_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c210958cd8727ca83fc0ac1d54f8dbf39f55bdd5
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/minerva_math_4_shot.json
@@ -0,0 +1,525 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.1758,
+ "exact_match_stderr,none": 0.005170915337066609,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.2670598146588037,
+ "exact_match_stderr,none": 0.012846836411288906
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.15611814345991562,
+ "exact_match_stderr,none": 0.01668925473342588
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.1315240083507307,
+ "exact_match_stderr,none": 0.015458504556847509
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.04983388704318937,
+ "exact_match_stderr,none": 0.007245341858973181
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.0962962962962963,
+ "exact_match_stderr,none": 0.012706426844176376
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.3340987370838117,
+ "exact_match_stderr,none": 0.015991260938213656
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.06776556776556776,
+ "exact_match_stderr,none": 0.010766359056008468
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.1758,
+ "exact_match_stderr,none": 0.005170915337066609,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_precalc": {
+ "task": "minerva_math_precalc",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "precalculus",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "minerva_math": 1.0,
+ "minerva_math_algebra": 1.0,
+ "minerva_math_counting_and_prob": 1.0,
+ "minerva_math_geometry": 1.0,
+ "minerva_math_intermediate_algebra": 1.0,
+ "minerva_math_num_theory": 1.0,
+ "minerva_math_prealgebra": 1.0,
+ "minerva_math_precalc": 1.0
+ },
+ "n-shot": {
+ "minerva_math_algebra": 4,
+ "minerva_math_counting_and_prob": 4,
+ "minerva_math_geometry": 4,
+ "minerva_math_intermediate_algebra": 4,
+ "minerva_math_num_theory": 4,
+ "minerva_math_prealgebra": 4,
+ "minerva_math_precalc": 4
+ },
+ "higher_is_better": {
+ "minerva_math": {
+ "exact_match": true
+ },
+ "minerva_math_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_counting_and_prob": {
+ "exact_match": true
+ },
+ "minerva_math_geometry": {
+ "exact_match": true
+ },
+ "minerva_math_intermediate_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_num_theory": {
+ "exact_match": true
+ },
+ "minerva_math_prealgebra": {
+ "exact_match": true
+ },
+ "minerva_math_precalc": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "minerva_math_algebra": {
+ "original": 1187,
+ "effective": 1187
+ },
+ "minerva_math_counting_and_prob": {
+ "original": 474,
+ "effective": 474
+ },
+ "minerva_math_geometry": {
+ "original": 479,
+ "effective": 479
+ },
+ "minerva_math_intermediate_algebra": {
+ "original": 903,
+ "effective": 903
+ },
+ "minerva_math_num_theory": {
+ "original": 540,
+ "effective": 540
+ },
+ "minerva_math_prealgebra": {
+ "original": 871,
+ "effective": 871
+ },
+ "minerva_math_precalc": {
+ "original": 546,
+ "effective": 546
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457279.5400486,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 934767.019303019,
+ "end_time": 971111.469964088,
+ "total_evaluation_time_seconds": "36344.450661069015"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/mmlu_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/mmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..2d7d9a2bac8835e3e187a0c6c26fd1b5c6293232
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/mmlu_0_shot.json
@@ -0,0 +1,3283 @@
+{
+ "results": {
+ "mmlu": {
+ "acc,none": 0.6462042444096282,
+ "acc_stderr,none": 0.0038063070482910162,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5976620616365569,
+ "acc_stderr,none": 0.006774327437175231,
+ "alias": " - humanities"
+ },
+ "mmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.5079365079365079,
+ "acc_stderr,none": 0.044715725362943486
+ },
+ "mmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7696969696969697,
+ "acc_stderr,none": 0.0328766675860349
+ },
+ "mmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.8186274509803921,
+ "acc_stderr,none": 0.02704462171947407
+ },
+ "mmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.8354430379746836,
+ "acc_stderr,none": 0.024135736240566946
+ },
+ "mmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.7851239669421488,
+ "acc_stderr,none": 0.03749492448709699
+ },
+ "mmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.7592592592592593,
+ "acc_stderr,none": 0.041331194402438376
+ },
+ "mmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7852760736196319,
+ "acc_stderr,none": 0.03226219377286774
+ },
+ "mmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.7225433526011561,
+ "acc_stderr,none": 0.024105712607754307
+ },
+ "mmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.4134078212290503,
+ "acc_stderr,none": 0.016469814928406164
+ },
+ "mmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.7041800643086816,
+ "acc_stderr,none": 0.025922371788818788
+ },
+ "mmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.7345679012345679,
+ "acc_stderr,none": 0.02456922360046085
+ },
+ "mmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.47783572359843546,
+ "acc_stderr,none": 0.012757683047716177
+ },
+ "mmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.8245614035087719,
+ "acc_stderr,none": 0.029170885500727654
+ },
+ "mmlu_other": {
+ "acc,none": 0.7129063405214033,
+ "acc_stderr,none": 0.007791731325474898,
+ "alias": " - other"
+ },
+ "mmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.04688261722621505
+ },
+ "mmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.7433962264150943,
+ "acc_stderr,none": 0.026880647889051968
+ },
+ "mmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6242774566473989,
+ "acc_stderr,none": 0.036928207672648664
+ },
+ "mmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.36,
+ "acc_stderr,none": 0.048241815132442176
+ },
+ "mmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6995515695067265,
+ "acc_stderr,none": 0.03076935200822914
+ },
+ "mmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8058252427184466,
+ "acc_stderr,none": 0.03916667762822583
+ },
+ "mmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.9145299145299145,
+ "acc_stderr,none": 0.018315891685625828
+ },
+ "mmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "mmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.8263090676883781,
+ "acc_stderr,none": 0.013547415658662259
+ },
+ "mmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.7156862745098039,
+ "acc_stderr,none": 0.025829163272757468
+ },
+ "mmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.5212765957446809,
+ "acc_stderr,none": 0.029800481645628693
+ },
+ "mmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.6580882352941176,
+ "acc_stderr,none": 0.028814722422254174
+ },
+ "mmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5180722891566265,
+ "acc_stderr,none": 0.038899512528272166
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.7595060123496913,
+ "acc_stderr,none": 0.007537668422916037,
+ "alias": " - social sciences"
+ },
+ "mmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.047036043419179864
+ },
+ "mmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.803030303030303,
+ "acc_stderr,none": 0.02833560973246336
+ },
+ "mmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8911917098445595,
+ "acc_stderr,none": 0.02247325333276876
+ },
+ "mmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.6487179487179487,
+ "acc_stderr,none": 0.024203665177902803
+ },
+ "mmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.7436974789915967,
+ "acc_stderr,none": 0.02835962087053395
+ },
+ "mmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8440366972477065,
+ "acc_stderr,none": 0.015555802713590144
+ },
+ "mmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7938931297709924,
+ "acc_stderr,none": 0.03547771004159463
+ },
+ "mmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.7026143790849673,
+ "acc_stderr,none": 0.018492596536396955
+ },
+ "mmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.7181818181818181,
+ "acc_stderr,none": 0.04309118709946458
+ },
+ "mmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7510204081632653,
+ "acc_stderr,none": 0.02768297952296023
+ },
+ "mmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.8656716417910447,
+ "acc_stderr,none": 0.024112678240900822
+ },
+ "mmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.88,
+ "acc_stderr,none": 0.03265986323710906
+ },
+ "mmlu_stem": {
+ "acc,none": 0.5423406279733587,
+ "acc_stderr,none": 0.008491791160159868,
+ "alias": " - stem"
+ },
+ "mmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.34,
+ "acc_stderr,none": 0.04760952285695235
+ },
+ "mmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.6444444444444445,
+ "acc_stderr,none": 0.04135176749720385
+ },
+ "mmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.7236842105263158,
+ "acc_stderr,none": 0.03639057569952929
+ },
+ "mmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.7708333333333334,
+ "acc_stderr,none": 0.035146974678623884
+ },
+ "mmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.050211673156867795
+ },
+ "mmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "mmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.37,
+ "acc_stderr,none": 0.04852365870939099
+ },
+ "mmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.4117647058823529,
+ "acc_stderr,none": 0.048971049527263666
+ },
+ "mmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.77,
+ "acc_stderr,none": 0.042295258468165065
+ },
+ "mmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.5702127659574469,
+ "acc_stderr,none": 0.03236214467715564
+ },
+ "mmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.040824829046386284
+ },
+ "mmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.455026455026455,
+ "acc_stderr,none": 0.025646928361049398
+ },
+ "mmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.7838709677419354,
+ "acc_stderr,none": 0.023415293433568518
+ },
+ "mmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.5221674876847291,
+ "acc_stderr,none": 0.035145285621750094
+ },
+ "mmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.68,
+ "acc_stderr,none": 0.04688261722621505
+ },
+ "mmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.32222222222222224,
+ "acc_stderr,none": 0.028493465091028593
+ },
+ "mmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.4105960264900662,
+ "acc_stderr,none": 0.04016689594849928
+ },
+ "mmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.5046296296296297,
+ "acc_stderr,none": 0.03409825519163572
+ },
+ "mmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.4107142857142857,
+ "acc_stderr,none": 0.04669510663875191
+ }
+ },
+ "groups": {
+ "mmlu": {
+ "acc,none": 0.6462042444096282,
+ "acc_stderr,none": 0.0038063070482910162,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5976620616365569,
+ "acc_stderr,none": 0.006774327437175231,
+ "alias": " - humanities"
+ },
+ "mmlu_other": {
+ "acc,none": 0.7129063405214033,
+ "acc_stderr,none": 0.007791731325474898,
+ "alias": " - other"
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.7595060123496913,
+ "acc_stderr,none": 0.007537668422916037,
+ "alias": " - social sciences"
+ },
+ "mmlu_stem": {
+ "acc,none": 0.5423406279733587,
+ "acc_stderr,none": 0.008491791160159868,
+ "alias": " - stem"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_humanities": [
+ "mmlu_logical_fallacies",
+ "mmlu_prehistory",
+ "mmlu_moral_disputes",
+ "mmlu_jurisprudence",
+ "mmlu_international_law",
+ "mmlu_world_religions",
+ "mmlu_formal_logic",
+ "mmlu_philosophy",
+ "mmlu_moral_scenarios",
+ "mmlu_high_school_world_history",
+ "mmlu_high_school_us_history",
+ "mmlu_professional_law",
+ "mmlu_high_school_european_history"
+ ],
+ "mmlu_social_sciences": [
+ "mmlu_high_school_microeconomics",
+ "mmlu_human_sexuality",
+ "mmlu_professional_psychology",
+ "mmlu_sociology",
+ "mmlu_high_school_government_and_politics",
+ "mmlu_security_studies",
+ "mmlu_econometrics",
+ "mmlu_high_school_psychology",
+ "mmlu_high_school_geography",
+ "mmlu_public_relations",
+ "mmlu_us_foreign_policy",
+ "mmlu_high_school_macroeconomics"
+ ],
+ "mmlu_other": [
+ "mmlu_clinical_knowledge",
+ "mmlu_medical_genetics",
+ "mmlu_professional_medicine",
+ "mmlu_miscellaneous",
+ "mmlu_management",
+ "mmlu_marketing",
+ "mmlu_business_ethics",
+ "mmlu_virology",
+ "mmlu_nutrition",
+ "mmlu_college_medicine",
+ "mmlu_professional_accounting",
+ "mmlu_human_aging",
+ "mmlu_global_facts"
+ ],
+ "mmlu_stem": [
+ "mmlu_abstract_algebra",
+ "mmlu_college_biology",
+ "mmlu_high_school_biology",
+ "mmlu_electrical_engineering",
+ "mmlu_college_mathematics",
+ "mmlu_conceptual_physics",
+ "mmlu_high_school_physics",
+ "mmlu_anatomy",
+ "mmlu_high_school_mathematics",
+ "mmlu_high_school_chemistry",
+ "mmlu_computer_security",
+ "mmlu_college_computer_science",
+ "mmlu_astronomy",
+ "mmlu_elementary_mathematics",
+ "mmlu_high_school_statistics",
+ "mmlu_college_physics",
+ "mmlu_high_school_computer_science",
+ "mmlu_college_chemistry",
+ "mmlu_machine_learning"
+ ],
+ "mmlu": [
+ "mmlu_stem",
+ "mmlu_other",
+ "mmlu_social_sciences",
+ "mmlu_humanities"
+ ]
+ },
+ "configs": {
+ "mmlu_abstract_algebra": {
+ "task": "mmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "abstract_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about abstract algebra.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_anatomy": {
+ "task": "mmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "anatomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about anatomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_astronomy": {
+ "task": "mmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "astronomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about astronomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_business_ethics": {
+ "task": "mmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "business_ethics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about business ethics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_clinical_knowledge": {
+ "task": "mmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "clinical_knowledge",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about clinical knowledge.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_biology": {
+ "task": "mmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_chemistry": {
+ "task": "mmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_computer_science": {
+ "task": "mmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_mathematics": {
+ "task": "mmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_medicine": {
+ "task": "mmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_physics": {
+ "task": "mmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_computer_security": {
+ "task": "mmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "computer_security",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about computer security.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_conceptual_physics": {
+ "task": "mmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "conceptual_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about conceptual physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_econometrics": {
+ "task": "mmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "econometrics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about econometrics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_electrical_engineering": {
+ "task": "mmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "electrical_engineering",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about electrical engineering.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_elementary_mathematics": {
+ "task": "mmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "elementary_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about elementary mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_formal_logic": {
+ "task": "mmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "formal_logic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about formal logic.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_global_facts": {
+ "task": "mmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "global_facts",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about global facts.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_biology": {
+ "task": "mmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_chemistry": {
+ "task": "mmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_computer_science": {
+ "task": "mmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_european_history": {
+ "task": "mmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_european_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school european history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_geography": {
+ "task": "mmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_geography",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school geography.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_government_and_politics": {
+ "task": "mmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_government_and_politics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school government and politics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_macroeconomics": {
+ "task": "mmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_macroeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school macroeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_mathematics": {
+ "task": "mmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_microeconomics": {
+ "task": "mmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_microeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school microeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_physics": {
+ "task": "mmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_psychology": {
+ "task": "mmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_statistics": {
+ "task": "mmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_statistics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school statistics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_us_history": {
+ "task": "mmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_us_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school us history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_world_history": {
+ "task": "mmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_world_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school world history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_aging": {
+ "task": "mmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_aging",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human aging.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_sexuality": {
+ "task": "mmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_sexuality",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human sexuality.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_international_law": {
+ "task": "mmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "international_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about international law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_jurisprudence": {
+ "task": "mmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "jurisprudence",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about jurisprudence.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_logical_fallacies": {
+ "task": "mmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "logical_fallacies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about logical fallacies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_machine_learning": {
+ "task": "mmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "machine_learning",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about machine learning.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_management": {
+ "task": "mmlu_management",
+ "task_alias": "management",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "management",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about management.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_marketing": {
+ "task": "mmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "marketing",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about marketing.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_medical_genetics": {
+ "task": "mmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "medical_genetics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about medical genetics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_miscellaneous": {
+ "task": "mmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "miscellaneous",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about miscellaneous.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_disputes": {
+ "task": "mmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_disputes",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral disputes.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_scenarios": {
+ "task": "mmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_scenarios",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral scenarios.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_nutrition": {
+ "task": "mmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "nutrition",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about nutrition.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_philosophy": {
+ "task": "mmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "philosophy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about philosophy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_prehistory": {
+ "task": "mmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "prehistory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about prehistory.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_accounting": {
+ "task": "mmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_accounting",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional accounting.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_law": {
+ "task": "mmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_medicine": {
+ "task": "mmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_psychology": {
+ "task": "mmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_public_relations": {
+ "task": "mmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "public_relations",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about public relations.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_security_studies": {
+ "task": "mmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "security_studies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about security studies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_sociology": {
+ "task": "mmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "sociology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about sociology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_us_foreign_policy": {
+ "task": "mmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "us_foreign_policy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about us foreign policy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_virology": {
+ "task": "mmlu_virology",
+ "task_alias": "virology",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "virology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about virology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_world_religions": {
+ "task": "mmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "world_religions",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about world religions.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu": 2,
+ "mmlu_abstract_algebra": 1.0,
+ "mmlu_anatomy": 1.0,
+ "mmlu_astronomy": 1.0,
+ "mmlu_business_ethics": 1.0,
+ "mmlu_clinical_knowledge": 1.0,
+ "mmlu_college_biology": 1.0,
+ "mmlu_college_chemistry": 1.0,
+ "mmlu_college_computer_science": 1.0,
+ "mmlu_college_mathematics": 1.0,
+ "mmlu_college_medicine": 1.0,
+ "mmlu_college_physics": 1.0,
+ "mmlu_computer_security": 1.0,
+ "mmlu_conceptual_physics": 1.0,
+ "mmlu_econometrics": 1.0,
+ "mmlu_electrical_engineering": 1.0,
+ "mmlu_elementary_mathematics": 1.0,
+ "mmlu_formal_logic": 1.0,
+ "mmlu_global_facts": 1.0,
+ "mmlu_high_school_biology": 1.0,
+ "mmlu_high_school_chemistry": 1.0,
+ "mmlu_high_school_computer_science": 1.0,
+ "mmlu_high_school_european_history": 1.0,
+ "mmlu_high_school_geography": 1.0,
+ "mmlu_high_school_government_and_politics": 1.0,
+ "mmlu_high_school_macroeconomics": 1.0,
+ "mmlu_high_school_mathematics": 1.0,
+ "mmlu_high_school_microeconomics": 1.0,
+ "mmlu_high_school_physics": 1.0,
+ "mmlu_high_school_psychology": 1.0,
+ "mmlu_high_school_statistics": 1.0,
+ "mmlu_high_school_us_history": 1.0,
+ "mmlu_high_school_world_history": 1.0,
+ "mmlu_human_aging": 1.0,
+ "mmlu_human_sexuality": 1.0,
+ "mmlu_humanities": 2,
+ "mmlu_international_law": 1.0,
+ "mmlu_jurisprudence": 1.0,
+ "mmlu_logical_fallacies": 1.0,
+ "mmlu_machine_learning": 1.0,
+ "mmlu_management": 1.0,
+ "mmlu_marketing": 1.0,
+ "mmlu_medical_genetics": 1.0,
+ "mmlu_miscellaneous": 1.0,
+ "mmlu_moral_disputes": 1.0,
+ "mmlu_moral_scenarios": 1.0,
+ "mmlu_nutrition": 1.0,
+ "mmlu_other": 2,
+ "mmlu_philosophy": 1.0,
+ "mmlu_prehistory": 1.0,
+ "mmlu_professional_accounting": 1.0,
+ "mmlu_professional_law": 1.0,
+ "mmlu_professional_medicine": 1.0,
+ "mmlu_professional_psychology": 1.0,
+ "mmlu_public_relations": 1.0,
+ "mmlu_security_studies": 1.0,
+ "mmlu_social_sciences": 2,
+ "mmlu_sociology": 1.0,
+ "mmlu_stem": 2,
+ "mmlu_us_foreign_policy": 1.0,
+ "mmlu_virology": 1.0,
+ "mmlu_world_religions": 1.0
+ },
+ "n-shot": {
+ "mmlu_abstract_algebra": 0,
+ "mmlu_anatomy": 0,
+ "mmlu_astronomy": 0,
+ "mmlu_business_ethics": 0,
+ "mmlu_clinical_knowledge": 0,
+ "mmlu_college_biology": 0,
+ "mmlu_college_chemistry": 0,
+ "mmlu_college_computer_science": 0,
+ "mmlu_college_mathematics": 0,
+ "mmlu_college_medicine": 0,
+ "mmlu_college_physics": 0,
+ "mmlu_computer_security": 0,
+ "mmlu_conceptual_physics": 0,
+ "mmlu_econometrics": 0,
+ "mmlu_electrical_engineering": 0,
+ "mmlu_elementary_mathematics": 0,
+ "mmlu_formal_logic": 0,
+ "mmlu_global_facts": 0,
+ "mmlu_high_school_biology": 0,
+ "mmlu_high_school_chemistry": 0,
+ "mmlu_high_school_computer_science": 0,
+ "mmlu_high_school_european_history": 0,
+ "mmlu_high_school_geography": 0,
+ "mmlu_high_school_government_and_politics": 0,
+ "mmlu_high_school_macroeconomics": 0,
+ "mmlu_high_school_mathematics": 0,
+ "mmlu_high_school_microeconomics": 0,
+ "mmlu_high_school_physics": 0,
+ "mmlu_high_school_psychology": 0,
+ "mmlu_high_school_statistics": 0,
+ "mmlu_high_school_us_history": 0,
+ "mmlu_high_school_world_history": 0,
+ "mmlu_human_aging": 0,
+ "mmlu_human_sexuality": 0,
+ "mmlu_international_law": 0,
+ "mmlu_jurisprudence": 0,
+ "mmlu_logical_fallacies": 0,
+ "mmlu_machine_learning": 0,
+ "mmlu_management": 0,
+ "mmlu_marketing": 0,
+ "mmlu_medical_genetics": 0,
+ "mmlu_miscellaneous": 0,
+ "mmlu_moral_disputes": 0,
+ "mmlu_moral_scenarios": 0,
+ "mmlu_nutrition": 0,
+ "mmlu_philosophy": 0,
+ "mmlu_prehistory": 0,
+ "mmlu_professional_accounting": 0,
+ "mmlu_professional_law": 0,
+ "mmlu_professional_medicine": 0,
+ "mmlu_professional_psychology": 0,
+ "mmlu_public_relations": 0,
+ "mmlu_security_studies": 0,
+ "mmlu_sociology": 0,
+ "mmlu_us_foreign_policy": 0,
+ "mmlu_virology": 0,
+ "mmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "mmlu": {
+ "acc": true
+ },
+ "mmlu_abstract_algebra": {
+ "acc": true
+ },
+ "mmlu_anatomy": {
+ "acc": true
+ },
+ "mmlu_astronomy": {
+ "acc": true
+ },
+ "mmlu_business_ethics": {
+ "acc": true
+ },
+ "mmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "mmlu_college_biology": {
+ "acc": true
+ },
+ "mmlu_college_chemistry": {
+ "acc": true
+ },
+ "mmlu_college_computer_science": {
+ "acc": true
+ },
+ "mmlu_college_mathematics": {
+ "acc": true
+ },
+ "mmlu_college_medicine": {
+ "acc": true
+ },
+ "mmlu_college_physics": {
+ "acc": true
+ },
+ "mmlu_computer_security": {
+ "acc": true
+ },
+ "mmlu_conceptual_physics": {
+ "acc": true
+ },
+ "mmlu_econometrics": {
+ "acc": true
+ },
+ "mmlu_electrical_engineering": {
+ "acc": true
+ },
+ "mmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "mmlu_formal_logic": {
+ "acc": true
+ },
+ "mmlu_global_facts": {
+ "acc": true
+ },
+ "mmlu_high_school_biology": {
+ "acc": true
+ },
+ "mmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "mmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "mmlu_high_school_european_history": {
+ "acc": true
+ },
+ "mmlu_high_school_geography": {
+ "acc": true
+ },
+ "mmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "mmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "mmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_physics": {
+ "acc": true
+ },
+ "mmlu_high_school_psychology": {
+ "acc": true
+ },
+ "mmlu_high_school_statistics": {
+ "acc": true
+ },
+ "mmlu_high_school_us_history": {
+ "acc": true
+ },
+ "mmlu_high_school_world_history": {
+ "acc": true
+ },
+ "mmlu_human_aging": {
+ "acc": true
+ },
+ "mmlu_human_sexuality": {
+ "acc": true
+ },
+ "mmlu_humanities": {
+ "acc": true
+ },
+ "mmlu_international_law": {
+ "acc": true
+ },
+ "mmlu_jurisprudence": {
+ "acc": true
+ },
+ "mmlu_logical_fallacies": {
+ "acc": true
+ },
+ "mmlu_machine_learning": {
+ "acc": true
+ },
+ "mmlu_management": {
+ "acc": true
+ },
+ "mmlu_marketing": {
+ "acc": true
+ },
+ "mmlu_medical_genetics": {
+ "acc": true
+ },
+ "mmlu_miscellaneous": {
+ "acc": true
+ },
+ "mmlu_moral_disputes": {
+ "acc": true
+ },
+ "mmlu_moral_scenarios": {
+ "acc": true
+ },
+ "mmlu_nutrition": {
+ "acc": true
+ },
+ "mmlu_other": {
+ "acc": true
+ },
+ "mmlu_philosophy": {
+ "acc": true
+ },
+ "mmlu_prehistory": {
+ "acc": true
+ },
+ "mmlu_professional_accounting": {
+ "acc": true
+ },
+ "mmlu_professional_law": {
+ "acc": true
+ },
+ "mmlu_professional_medicine": {
+ "acc": true
+ },
+ "mmlu_professional_psychology": {
+ "acc": true
+ },
+ "mmlu_public_relations": {
+ "acc": true
+ },
+ "mmlu_security_studies": {
+ "acc": true
+ },
+ "mmlu_social_sciences": {
+ "acc": true
+ },
+ "mmlu_sociology": {
+ "acc": true
+ },
+ "mmlu_stem": {
+ "acc": true
+ },
+ "mmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "mmlu_virology": {
+ "acc": true
+ },
+ "mmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "mmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "mmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "mmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "mmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "mmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "mmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "mmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "mmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "mmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "mmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "mmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "mmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "mmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "mmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "mmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "mmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "mmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "mmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "mmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "mmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "mmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "mmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "mmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "mmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "mmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "mmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "mmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "mmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "mmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "mmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "mmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "mmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "mmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "mmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "mmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "mmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "mmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "mmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "mmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "mmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "mmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "mmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "mmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "mmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "mmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "mmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "mmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735753135.2200181,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 13420.581787327,
+ "end_time": 13936.337741695,
+ "total_evaluation_time_seconds": "515.755954368"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/mmlu_pro_5_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/mmlu_pro_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..05ef98bb43eeb0856c9bc1aa2c76b4738b45f952
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/mmlu_pro_5_shot.json
@@ -0,0 +1,1092 @@
+{
+ "results": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.3738364361702128,
+ "exact_match_stderr,custom-extract": 0.004252409639096892,
+ "alias": "mmlu_pro"
+ },
+ "mmlu_pro_biology": {
+ "alias": " - biology",
+ "exact_match,custom-extract": 0.6345885634588564,
+ "exact_match_stderr,custom-extract": 0.017996194452856686
+ },
+ "mmlu_pro_business": {
+ "alias": " - business",
+ "exact_match,custom-extract": 0.3333333333333333,
+ "exact_match_stderr,custom-extract": 0.016793090728662703
+ },
+ "mmlu_pro_chemistry": {
+ "alias": " - chemistry",
+ "exact_match,custom-extract": 0.22879858657243815,
+ "exact_match_stderr,custom-extract": 0.012490484206630341
+ },
+ "mmlu_pro_computer_science": {
+ "alias": " - computer_science",
+ "exact_match,custom-extract": 0.36097560975609755,
+ "exact_match_stderr,custom-extract": 0.02374848953721164
+ },
+ "mmlu_pro_economics": {
+ "alias": " - economics",
+ "exact_match,custom-extract": 0.5071090047393365,
+ "exact_match_stderr,custom-extract": 0.017219174050578705
+ },
+ "mmlu_pro_engineering": {
+ "alias": " - engineering",
+ "exact_match,custom-extract": 0.2260061919504644,
+ "exact_match_stderr,custom-extract": 0.013442846309135108
+ },
+ "mmlu_pro_health": {
+ "alias": " - health",
+ "exact_match,custom-extract": 0.4682151589242054,
+ "exact_match_stderr,custom-extract": 0.017457404845467168
+ },
+ "mmlu_pro_history": {
+ "alias": " - history",
+ "exact_match,custom-extract": 0.4645669291338583,
+ "exact_match_stderr,custom-extract": 0.025584971816786917
+ },
+ "mmlu_pro_law": {
+ "alias": " - law",
+ "exact_match,custom-extract": 0.259763851044505,
+ "exact_match_stderr,custom-extract": 0.013221421761500748
+ },
+ "mmlu_pro_math": {
+ "alias": " - math",
+ "exact_match,custom-extract": 0.31088082901554404,
+ "exact_match_stderr,custom-extract": 0.012597293629575347
+ },
+ "mmlu_pro_other": {
+ "alias": " - other",
+ "exact_match,custom-extract": 0.4621212121212121,
+ "exact_match_stderr,custom-extract": 0.01641040540830853
+ },
+ "mmlu_pro_philosophy": {
+ "alias": " - philosophy",
+ "exact_match,custom-extract": 0.3927855711422846,
+ "exact_match_stderr,custom-extract": 0.0218843742390035
+ },
+ "mmlu_pro_physics": {
+ "alias": " - physics",
+ "exact_match,custom-extract": 0.2748267898383372,
+ "exact_match_stderr,custom-extract": 0.012391191308891016
+ },
+ "mmlu_pro_psychology": {
+ "alias": " - psychology",
+ "exact_match,custom-extract": 0.6015037593984962,
+ "exact_match_stderr,custom-extract": 0.017342117588233962
+ }
+ },
+ "groups": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.3738364361702128,
+ "exact_match_stderr,custom-extract": 0.004252409639096892,
+ "alias": "mmlu_pro"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_pro": [
+ "mmlu_pro_biology",
+ "mmlu_pro_business",
+ "mmlu_pro_chemistry",
+ "mmlu_pro_computer_science",
+ "mmlu_pro_economics",
+ "mmlu_pro_engineering",
+ "mmlu_pro_health",
+ "mmlu_pro_history",
+ "mmlu_pro_law",
+ "mmlu_pro_math",
+ "mmlu_pro_other",
+ "mmlu_pro_philosophy",
+ "mmlu_pro_physics",
+ "mmlu_pro_psychology"
+ ]
+ },
+ "configs": {
+ "mmlu_pro_biology": {
+ "task": "mmlu_pro_biology",
+ "task_alias": "biology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='biology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_business": {
+ "task": "mmlu_pro_business",
+ "task_alias": "business",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='business')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_chemistry": {
+ "task": "mmlu_pro_chemistry",
+ "task_alias": "chemistry",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='chemistry')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_computer_science": {
+ "task": "mmlu_pro_computer_science",
+ "task_alias": "computer_science",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='computer science')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_economics": {
+ "task": "mmlu_pro_economics",
+ "task_alias": "economics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='economics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_engineering": {
+ "task": "mmlu_pro_engineering",
+ "task_alias": "engineering",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='engineering')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_health": {
+ "task": "mmlu_pro_health",
+ "task_alias": "health",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='health')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_history": {
+ "task": "mmlu_pro_history",
+ "task_alias": "history",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='history')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_law": {
+ "task": "mmlu_pro_law",
+ "task_alias": "law",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='law')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_math": {
+ "task": "mmlu_pro_math",
+ "task_alias": "math",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='math')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_other": {
+ "task": "mmlu_pro_other",
+ "task_alias": "other",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='other')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_philosophy": {
+ "task": "mmlu_pro_philosophy",
+ "task_alias": "philosophy",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='philosophy')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_physics": {
+ "task": "mmlu_pro_physics",
+ "task_alias": "physics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='physics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_psychology": {
+ "task": "mmlu_pro_psychology",
+ "task_alias": "psychology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='psychology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu_pro": 2.0,
+ "mmlu_pro_biology": 1.0,
+ "mmlu_pro_business": 1.0,
+ "mmlu_pro_chemistry": 1.0,
+ "mmlu_pro_computer_science": 1.0,
+ "mmlu_pro_economics": 1.0,
+ "mmlu_pro_engineering": 1.0,
+ "mmlu_pro_health": 1.0,
+ "mmlu_pro_history": 1.0,
+ "mmlu_pro_law": 1.0,
+ "mmlu_pro_math": 1.0,
+ "mmlu_pro_other": 1.0,
+ "mmlu_pro_philosophy": 1.0,
+ "mmlu_pro_physics": 1.0,
+ "mmlu_pro_psychology": 1.0
+ },
+ "n-shot": {
+ "mmlu_pro_biology": 5,
+ "mmlu_pro_business": 5,
+ "mmlu_pro_chemistry": 5,
+ "mmlu_pro_computer_science": 5,
+ "mmlu_pro_economics": 5,
+ "mmlu_pro_engineering": 5,
+ "mmlu_pro_health": 5,
+ "mmlu_pro_history": 5,
+ "mmlu_pro_law": 5,
+ "mmlu_pro_math": 5,
+ "mmlu_pro_other": 5,
+ "mmlu_pro_philosophy": 5,
+ "mmlu_pro_physics": 5,
+ "mmlu_pro_psychology": 5
+ },
+ "higher_is_better": {
+ "mmlu_pro": {
+ "exact_match": true
+ },
+ "mmlu_pro_biology": {
+ "exact_match": true
+ },
+ "mmlu_pro_business": {
+ "exact_match": true
+ },
+ "mmlu_pro_chemistry": {
+ "exact_match": true
+ },
+ "mmlu_pro_computer_science": {
+ "exact_match": true
+ },
+ "mmlu_pro_economics": {
+ "exact_match": true
+ },
+ "mmlu_pro_engineering": {
+ "exact_match": true
+ },
+ "mmlu_pro_health": {
+ "exact_match": true
+ },
+ "mmlu_pro_history": {
+ "exact_match": true
+ },
+ "mmlu_pro_law": {
+ "exact_match": true
+ },
+ "mmlu_pro_math": {
+ "exact_match": true
+ },
+ "mmlu_pro_other": {
+ "exact_match": true
+ },
+ "mmlu_pro_philosophy": {
+ "exact_match": true
+ },
+ "mmlu_pro_physics": {
+ "exact_match": true
+ },
+ "mmlu_pro_psychology": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "mmlu_pro_biology": {
+ "original": 717,
+ "effective": 717
+ },
+ "mmlu_pro_business": {
+ "original": 789,
+ "effective": 789
+ },
+ "mmlu_pro_chemistry": {
+ "original": 1132,
+ "effective": 1132
+ },
+ "mmlu_pro_computer_science": {
+ "original": 410,
+ "effective": 410
+ },
+ "mmlu_pro_economics": {
+ "original": 844,
+ "effective": 844
+ },
+ "mmlu_pro_engineering": {
+ "original": 969,
+ "effective": 969
+ },
+ "mmlu_pro_health": {
+ "original": 818,
+ "effective": 818
+ },
+ "mmlu_pro_history": {
+ "original": 381,
+ "effective": 381
+ },
+ "mmlu_pro_law": {
+ "original": 1101,
+ "effective": 1101
+ },
+ "mmlu_pro_math": {
+ "original": 1351,
+ "effective": 1351
+ },
+ "mmlu_pro_other": {
+ "original": 924,
+ "effective": 924
+ },
+ "mmlu_pro_philosophy": {
+ "original": 499,
+ "effective": 499
+ },
+ "mmlu_pro_physics": {
+ "original": 1299,
+ "effective": 1299
+ },
+ "mmlu_pro_psychology": {
+ "original": 798,
+ "effective": 798
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=False,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "ece011d373ab8a60d9278622397897a5bd60079b",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1731251974.9012728,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.38.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 146289.907102516,
+ "end_time": 214485.02461192,
+ "total_evaluation_time_seconds": "68195.117509404"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/triviaqa_5_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/triviaqa_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..af15bbd5eee40e5598d5072299d14b120d2ad94e
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/triviaqa_5_shot.json
@@ -0,0 +1,132 @@
+{
+ "results": {
+ "triviaqa": {
+ "alias": "triviaqa",
+ "exact_match,remove_whitespace": 0.6764935354436024,
+ "exact_match_stderr,remove_whitespace": 0.003492414467248401
+ }
+ },
+ "group_subtasks": {
+ "triviaqa": []
+ },
+ "configs": {
+ "triviaqa": {
+ "task": "triviaqa",
+ "dataset_path": "trivia_qa",
+ "dataset_name": "rc.nocontext",
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "Question: {{question}}?\nAnswer:",
+ "doc_to_target": "{{answer.aliases}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "\n",
+ ".",
+ ","
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "remove_whitespace",
+ "filter": [
+ {
+ "function": "remove_whitespace"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "triviaqa": 3.0
+ },
+ "n-shot": {
+ "triviaqa": 5
+ },
+ "higher_is_better": {
+ "triviaqa": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "triviaqa": {
+ "original": 17944,
+ "effective": 17944
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732530416.4028962,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 876731.027243315,
+ "end_time": 880169.77139674,
+ "total_evaluation_time_seconds": "3438.744153424981"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/truthfulqa_mc2_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/truthfulqa_mc2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..781d17b3e32c94ef97041df01ed8a6d1a9d605bd
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/truthfulqa_mc2_0_shot.json
@@ -0,0 +1,112 @@
+{
+ "results": {
+ "truthfulqa_mc2": {
+ "alias": "truthfulqa_mc2",
+ "acc,none": 0.5520106526990918,
+ "acc_stderr,none": 0.015258721249238388
+ }
+ },
+ "group_subtasks": {
+ "truthfulqa_mc2": []
+ },
+ "configs": {
+ "truthfulqa_mc2": {
+ "task": "truthfulqa_mc2",
+ "tag": [
+ "truthfulqa"
+ ],
+ "dataset_path": "truthful_qa",
+ "dataset_name": "multiple_choice",
+ "validation_split": "validation",
+ "doc_to_text": "{% set prompt_qa = 'Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: I have no comment.\n\nQ: How does a telescope work?\nA: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\nQ: Where were the 1992 Olympics held?\nA: The 1992 Olympics were held in Barcelona, Spain.'%}{{prompt_qa + '\n\nQ: ' + question + '\nA:'}}",
+ "doc_to_target": 0,
+ "doc_to_choice": "{{mc2_targets.choices}}",
+ "process_results": "def process_results_mc2(doc, results):\n lls, is_greedy = zip(*results)\n\n # Split on the first `0` as everything before it is true (`1`).\n split_idx = list(doc[\"mc2_targets\"][\"labels\"]).index(0)\n # Compute the normalized probability mass for the correct answer.\n ll_true, ll_false = lls[:split_idx], lls[split_idx:]\n p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))\n p_true = p_true / (sum(p_true) + sum(p_false))\n\n return {\"acc\": sum(p_true)}\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "truthfulqa_mc2": 2.0
+ },
+ "n-shot": {
+ "truthfulqa_mc2": 0
+ },
+ "higher_is_better": {
+ "truthfulqa_mc2": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "truthfulqa_mc2": {
+ "original": 817,
+ "effective": 817
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457284.7916152,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 937621.506371343,
+ "end_time": 938295.585706235,
+ "total_evaluation_time_seconds": "674.0793348919833"
+}
\ No newline at end of file
diff --git a/evaluations/en/AceGPT-v2-8B-Chat/winogrande_0_shot.json b/evaluations/en/AceGPT-v2-8B-Chat/winogrande_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ff216e0e959d70231c76678c57d999715f6ee0bc
--- /dev/null
+++ b/evaluations/en/AceGPT-v2-8B-Chat/winogrande_0_shot.json
@@ -0,0 +1,112 @@
+{
+ "results": {
+ "winogrande": {
+ "alias": "winogrande",
+ "acc,none": 0.7371744277821626,
+ "acc_stderr,none": 0.012370922527262008
+ }
+ },
+ "group_subtasks": {
+ "winogrande": []
+ },
+ "configs": {
+ "winogrande": {
+ "task": "winogrande",
+ "dataset_path": "winogrande",
+ "dataset_name": "winogrande_xl",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "def doc_to_text(doc):\n answer_to_num = {\"1\": 0, \"2\": 1}\n return answer_to_num[doc[\"answer\"]]\n",
+ "doc_to_target": "def doc_to_target(doc):\n idx = doc[\"sentence\"].index(\"_\") + 1\n return doc[\"sentence\"][idx:].strip()\n",
+ "doc_to_choice": "def doc_to_choice(doc):\n idx = doc[\"sentence\"].index(\"_\")\n options = [doc[\"option1\"], doc[\"option2\"]]\n return [doc[\"sentence\"][:idx] + opt for opt in options]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "sentence",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "winogrande": 1.0
+ },
+ "n-shot": {
+ "winogrande": 0
+ },
+ "higher_is_better": {
+ "winogrande": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "winogrande": {
+ "original": 1267,
+ "effective": 1267
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "parallelize=True,pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,mm=False,trust_remote_code=True",
+ "model_num_parameters": 8030261248,
+ "model_dtype": "torch.float16",
+ "model_revision": "main",
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "3127d82f",
+ "date": 1732457295.7930105,
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
+ "transformers_version": "4.46.3",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_eos_token": [
+ "<|end_of_text|>",
+ "128001"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128001,
+ "max_length": 8192,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 934783.15582321,
+ "end_time": 935295.980413407,
+ "total_evaluation_time_seconds": "512.8245901969494"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/agieval_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..4320fd4e72b10b89f53fae55f061bdf74b181cff
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/agieval_0_shot.json
@@ -0,0 +1,1108 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.41993226898887276,
+ "acc_stderr,none": 0.005017576715285519,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.2755905511811024,
+ "acc_stderr,none": 0.028090790079239175,
+ "acc_norm,none": 0.27165354330708663,
+ "acc_norm_stderr,none": 0.027965103587140418
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.3238095238095238,
+ "acc_stderr,none": 0.03236727895404352,
+ "acc_norm,none": 0.36666666666666664,
+ "acc_norm_stderr,none": 0.03333333333333338
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.3188405797101449,
+ "acc_stderr,none": 0.032469647098784825,
+ "acc_norm,none": 0.32367149758454106,
+ "acc_norm_stderr,none": 0.03259848850179343
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.32926829268292684,
+ "acc_stderr,none": 0.0300238465846935,
+ "acc_norm,none": 0.3008130081300813,
+ "acc_norm_stderr,none": 0.02929961637067325
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.7352941176470589,
+ "acc_stderr,none": 0.025261691219729494,
+ "acc_norm,none": 0.7516339869281046,
+ "acc_norm_stderr,none": 0.02473998135511359
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.44221105527638194,
+ "acc_stderr,none": 0.03529532245511803,
+ "acc_norm,none": 0.44221105527638194,
+ "acc_norm_stderr,none": 0.03529532245511803
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.4425531914893617,
+ "acc_stderr,none": 0.03246956919789958,
+ "acc_norm,none": 0.39574468085106385,
+ "acc_norm_stderr,none": 0.03196758697835362
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.0423728813559322,
+ "acc_stderr,none": 0.018622984668462274
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.2849002849002849,
+ "acc_stderr,none": 0.02412657767241174,
+ "acc_norm,none": 0.27350427350427353,
+ "acc_norm_stderr,none": 0.023826736835458787
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.355,
+ "acc_stderr,none": 0.033920910080708536,
+ "acc_norm,none": 0.345,
+ "acc_norm_stderr,none": 0.03369796379336736
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.5055055055055055,
+ "acc_stderr,none": 0.01582626395175029,
+ "acc_norm,none": 0.48848848848848847,
+ "acc_norm_stderr,none": 0.015823028204038865
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.569,
+ "acc_stderr,none": 0.015667944488173505,
+ "acc_norm,none": 0.519,
+ "acc_norm_stderr,none": 0.01580787426850585
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.42857142857142855,
+ "acc_stderr,none": 0.01941046344247875,
+ "acc_norm,none": 0.42089093701996927,
+ "acc_norm_stderr,none": 0.019364589258764178
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.38556067588325654,
+ "acc_stderr,none": 0.019091022501354762,
+ "acc_norm,none": 0.3717357910906298,
+ "acc_norm_stderr,none": 0.018955343988228807
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.17391304347826086,
+ "acc_stderr,none": 0.02504731738604971,
+ "acc_norm,none": 0.1782608695652174,
+ "acc_norm_stderr,none": 0.025291655246273914
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.6980392156862745,
+ "acc_stderr,none": 0.020349619453119146,
+ "acc_norm,none": 0.6745098039215687,
+ "acc_norm_stderr,none": 0.020768455391819513
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.5724907063197026,
+ "acc_stderr,none": 0.030219662071838044,
+ "acc_norm,none": 0.5427509293680297,
+ "acc_norm_stderr,none": 0.03043051529856916
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.089,
+ "acc_stderr,none": 0.009008893392651537
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.8106796116504854,
+ "acc_stderr,none": 0.02736190862197997,
+ "acc_norm,none": 0.7912621359223301,
+ "acc_norm_stderr,none": 0.028384671935185523
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.4563106796116505,
+ "acc_stderr,none": 0.034787945997877434,
+ "acc_norm,none": 0.41262135922330095,
+ "acc_norm_stderr,none": 0.03438412659410015
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.4090909090909091,
+ "acc_stderr,none": 0.0332237149986403,
+ "acc_norm,none": 0.38181818181818183,
+ "acc_norm_stderr,none": 0.032829506847783727
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.41993226898887276,
+ "acc_stderr,none": 0.005017576715285519,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737542543.731756,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 20088.74081441,
+ "end_time": 21011.087011245,
+ "total_evaluation_time_seconds": "922.3461968349984"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/arc_challenge_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8e5f1c04f429842ba8c972b34ab2ba0a01ff0493
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/arc_challenge_0_shot.json
@@ -0,0 +1,117 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.5127986348122867,
+ "acc_stderr,none": 0.014606603181012541,
+ "acc_norm,none": 0.5127986348122867,
+ "acc_norm_stderr,none": 0.014606603181012538
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735958479.5122433,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 25148.877885035,
+ "end_time": 25235.270896756,
+ "total_evaluation_time_seconds": "86.39301172100022"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/gpqa_main_n_shot_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..291d49616e52584050b39979a27003a4d9e8ecb7
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,121 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.22767857142857142,
+ "acc_stderr,none": 0.0198338196436619,
+ "acc_norm,none": 0.22767857142857142,
+ "acc_norm_stderr,none": 0.0198338196436619
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961176.7588274,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "gpqa_main_n_shot": "4a64f5415ed03d5c5fec2b22dd8bfd718011928a30847c5b126c837aaf0c0619"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 330039.670361117,
+ "end_time": 330095.888966536,
+ "total_evaluation_time_seconds": "56.21860541898059"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/gsm8k_5_shot.json b/evaluations/en/Allam-7b-instruct-preview/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..a92f9253c0b94090a22e742cd03e997be388f5b1
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/gsm8k_5_shot.json
@@ -0,0 +1,153 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.6178923426838514,
+ "exact_match_stderr,strict-match": 0.013384173935648495,
+ "exact_match,flexible-extract": 0.6224412433661866,
+ "exact_match_stderr,flexible-extract": 0.013353150666358532
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737546137.8667536,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 23682.650060164,
+ "end_time": 23828.827645231,
+ "total_evaluation_time_seconds": "146.1775850669983"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/hellaswag_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..339022e2a1bd359c6293a2a4578cffc761605e28
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/hellaswag_0_shot.json
@@ -0,0 +1,118 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.5771758613821948,
+ "acc_stderr,none": 0.00492998369279507,
+ "acc_norm,none": 0.7625970922127067,
+ "acc_norm_stderr,none": 0.0042462162299898715
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735957117.4813576,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 23786.943776673,
+ "end_time": 23998.958401018,
+ "total_evaluation_time_seconds": "212.0146243449999"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/hendrycks_ethics_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..f285f48c16f3955fa8bc36e60655efafa46914a3
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/hendrycks_ethics_0_shot.json
@@ -0,0 +1,307 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.7392535392535392,
+ "acc_stderr,none": 0.007044761695158352
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.5786985539488321,
+ "acc_stderr,none": 0.00823518246369769
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.771819526627219,
+ "acc_stderr,none": 0.00807186884011459
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.6541181364392679,
+ "acc_stderr,none": 0.006860486742815242
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.9147738693467337,
+ "acc_stderr,none": 0.003959044383441912
+ }
+ },
+ "group_subtasks": {
+ "ethics_deontology": [],
+ "ethics_virtue": [],
+ "ethics_cm": [],
+ "ethics_utilitarianism": [],
+ "ethics_justice": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ },
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ },
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735957382.509422,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 24051.95882374,
+ "end_time": 24251.353762318,
+ "total_evaluation_time_seconds": "199.3949385779997"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/ifeval_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1d5c2cb7059ea1985b8590ce40110206e5851cbe
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/ifeval_0_shot.json
@@ -0,0 +1,132 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.3807763401109057,
+ "prompt_level_strict_acc_stderr,none": 0.020895937888190833,
+ "inst_level_strict_acc,none": 0.5,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.4214417744916821,
+ "prompt_level_loose_acc_stderr,none": 0.021249340085831084,
+ "inst_level_loose_acc,none": 0.5407673860911271,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737545156.5536008,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 22701.50615791,
+ "end_time": 22785.243168339,
+ "total_evaluation_time_seconds": "83.73701042899847"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/minerva_math_4_shot.json b/evaluations/en/Allam-7b-instruct-preview/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5050375945bf300b23bba52581d6b237a8562eb6
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/minerva_math_4_shot.json
@@ -0,0 +1,521 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.173,
+ "exact_match_stderr,none": 0.005146622162421542,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.2409435551811289,
+ "exact_match_stderr,none": 0.012418019817467794
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.17088607594936708,
+ "exact_match_stderr,none": 0.01730732195419626
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.12108559498956159,
+ "exact_match_stderr,none": 0.014921262921998898
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.053156146179401995,
+ "exact_match_stderr,none": 0.00746986334739643
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.11296296296296296,
+ "exact_match_stderr,none": 0.013634666880074295
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.34328358208955223,
+ "exact_match_stderr,none": 0.01609740338728602
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.05860805860805861,
+ "exact_match_stderr,none": 0.010061567725278785
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.173,
+ "exact_match_stderr,none": 0.005146622162421542,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_precalc": {
+ "task": "minerva_math_precalc",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "precalculus",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "minerva_math": 1.0,
+ "minerva_math_algebra": 1.0,
+ "minerva_math_counting_and_prob": 1.0,
+ "minerva_math_geometry": 1.0,
+ "minerva_math_intermediate_algebra": 1.0,
+ "minerva_math_num_theory": 1.0,
+ "minerva_math_prealgebra": 1.0,
+ "minerva_math_precalc": 1.0
+ },
+ "n-shot": {
+ "minerva_math_algebra": 4,
+ "minerva_math_counting_and_prob": 4,
+ "minerva_math_geometry": 4,
+ "minerva_math_intermediate_algebra": 4,
+ "minerva_math_num_theory": 4,
+ "minerva_math_prealgebra": 4,
+ "minerva_math_precalc": 4
+ },
+ "higher_is_better": {
+ "minerva_math": {
+ "exact_match": true
+ },
+ "minerva_math_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_counting_and_prob": {
+ "exact_match": true
+ },
+ "minerva_math_geometry": {
+ "exact_match": true
+ },
+ "minerva_math_intermediate_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_num_theory": {
+ "exact_match": true
+ },
+ "minerva_math_prealgebra": {
+ "exact_match": true
+ },
+ "minerva_math_precalc": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "minerva_math_algebra": {
+ "original": 1187,
+ "effective": 1187
+ },
+ "minerva_math_counting_and_prob": {
+ "original": 474,
+ "effective": 474
+ },
+ "minerva_math_geometry": {
+ "original": 479,
+ "effective": 479
+ },
+ "minerva_math_intermediate_algebra": {
+ "original": 903,
+ "effective": 903
+ },
+ "minerva_math_num_theory": {
+ "original": 540,
+ "effective": 540
+ },
+ "minerva_math_prealgebra": {
+ "original": 871,
+ "effective": 871
+ },
+ "minerva_math_precalc": {
+ "original": 546,
+ "effective": 546
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737544396.9634442,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 21941.885116993,
+ "end_time": 22486.922181144,
+ "total_evaluation_time_seconds": "545.0370641510017"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/mmlu_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/mmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d85ebb3d620eaf3ffd67ab2e441621c9deeb175e
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/mmlu_0_shot.json
@@ -0,0 +1,3289 @@
+{
+ "results": {
+ "mmlu": {
+ "acc,none": 0.5959977211223473,
+ "acc_stderr,none": 0.0038660270268163492,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5243358129649309,
+ "acc_stderr,none": 0.006614545142497863,
+ "alias": " - humanities"
+ },
+ "mmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.04444444444444449
+ },
+ "mmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.7878787878787878,
+ "acc_stderr,none": 0.031922715695482995
+ },
+ "mmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.7990196078431373,
+ "acc_stderr,none": 0.028125972265654362
+ },
+ "mmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.8354430379746836,
+ "acc_stderr,none": 0.02413573624056692
+ },
+ "mmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.8016528925619835,
+ "acc_stderr,none": 0.03640118271990947
+ },
+ "mmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.04557239513497752
+ },
+ "mmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7852760736196319,
+ "acc_stderr,none": 0.032262193772867744
+ },
+ "mmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.6358381502890174,
+ "acc_stderr,none": 0.025906632631016124
+ },
+ "mmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2011173184357542,
+ "acc_stderr,none": 0.013405946402609054
+ },
+ "mmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.6109324758842444,
+ "acc_stderr,none": 0.027690337536485376
+ },
+ "mmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.6666666666666666,
+ "acc_stderr,none": 0.026229649178821163
+ },
+ "mmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.439374185136897,
+ "acc_stderr,none": 0.012676014778580219
+ },
+ "mmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.8362573099415205,
+ "acc_stderr,none": 0.028380919596145866
+ },
+ "mmlu_other": {
+ "acc,none": 0.6829739298358545,
+ "acc_stderr,none": 0.008015460837332886,
+ "alias": " - other"
+ },
+ "mmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.04725815626252607
+ },
+ "mmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.6528301886792452,
+ "acc_stderr,none": 0.029300101705549645
+ },
+ "mmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.5780346820809249,
+ "acc_stderr,none": 0.0376574669386515
+ },
+ "mmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.43,
+ "acc_stderr,none": 0.049756985195624284
+ },
+ "mmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.6636771300448431,
+ "acc_stderr,none": 0.031708824268455
+ },
+ "mmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8058252427184466,
+ "acc_stderr,none": 0.03916667762822583
+ },
+ "mmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8632478632478633,
+ "acc_stderr,none": 0.022509033937077805
+ },
+ "mmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.75,
+ "acc_stderr,none": 0.04351941398892446
+ },
+ "mmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.8212005108556832,
+ "acc_stderr,none": 0.013702643715368976
+ },
+ "mmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.6764705882352942,
+ "acc_stderr,none": 0.026787453111906494
+ },
+ "mmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.450354609929078,
+ "acc_stderr,none": 0.029680105565029036
+ },
+ "mmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.6323529411764706,
+ "acc_stderr,none": 0.029289413409403196
+ },
+ "mmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.4939759036144578,
+ "acc_stderr,none": 0.03892212195333047
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.6932076698082548,
+ "acc_stderr,none": 0.008165633016061928,
+ "alias": " - social sciences"
+ },
+ "mmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.45614035087719296,
+ "acc_stderr,none": 0.046854730419077895
+ },
+ "mmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.7777777777777778,
+ "acc_stderr,none": 0.029620227874790458
+ },
+ "mmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8290155440414507,
+ "acc_stderr,none": 0.027171213683164542
+ },
+ "mmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.6230769230769231,
+ "acc_stderr,none": 0.024570975364225995
+ },
+ "mmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.6428571428571429,
+ "acc_stderr,none": 0.031124619309328177
+ },
+ "mmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.017149858514250934
+ },
+ "mmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.6870229007633588,
+ "acc_stderr,none": 0.04066962905677697
+ },
+ "mmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.6143790849673203,
+ "acc_stderr,none": 0.019691459052354025
+ },
+ "mmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6636363636363637,
+ "acc_stderr,none": 0.04525393596302505
+ },
+ "mmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.6857142857142857,
+ "acc_stderr,none": 0.029719329422417468
+ },
+ "mmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.736318407960199,
+ "acc_stderr,none": 0.031157150869355558
+ },
+ "mmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.040201512610368445
+ },
+ "mmlu_stem": {
+ "acc,none": 0.5223596574690771,
+ "acc_stderr,none": 0.00855240247531941,
+ "alias": " - stem"
+ },
+ "mmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.27,
+ "acc_stderr,none": 0.044619604333847394
+ },
+ "mmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.5259259259259259,
+ "acc_stderr,none": 0.04313531696750575
+ },
+ "mmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.7039473684210527,
+ "acc_stderr,none": 0.037150621549989056
+ },
+ "mmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.7361111111111112,
+ "acc_stderr,none": 0.03685651095897532
+ },
+ "mmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.45,
+ "acc_stderr,none": 0.049999999999999996
+ },
+ "mmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.5,
+ "acc_stderr,none": 0.050251890762960605
+ },
+ "mmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.33,
+ "acc_stderr,none": 0.047258156262526045
+ },
+ "mmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.35294117647058826,
+ "acc_stderr,none": 0.04755129616062948
+ },
+ "mmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.78,
+ "acc_stderr,none": 0.041633319989322605
+ },
+ "mmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.5829787234042553,
+ "acc_stderr,none": 0.03223276266711712
+ },
+ "mmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.5379310344827586,
+ "acc_stderr,none": 0.041546596717075474
+ },
+ "mmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.5396825396825397,
+ "acc_stderr,none": 0.02567008063690932
+ },
+ "mmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.7193548387096774,
+ "acc_stderr,none": 0.02556060472102288
+ },
+ "mmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.4876847290640394,
+ "acc_stderr,none": 0.035169204442208966
+ },
+ "mmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.049431107042371025
+ },
+ "mmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.32592592592592595,
+ "acc_stderr,none": 0.02857834836547308
+ },
+ "mmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.31788079470198677,
+ "acc_stderr,none": 0.03802039760107903
+ },
+ "mmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.5231481481481481,
+ "acc_stderr,none": 0.03406315360711507
+ },
+ "mmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.4017857142857143,
+ "acc_stderr,none": 0.04653333146973647
+ }
+ },
+ "groups": {
+ "mmlu": {
+ "acc,none": 0.5959977211223473,
+ "acc_stderr,none": 0.0038660270268163492,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5243358129649309,
+ "acc_stderr,none": 0.006614545142497863,
+ "alias": " - humanities"
+ },
+ "mmlu_other": {
+ "acc,none": 0.6829739298358545,
+ "acc_stderr,none": 0.008015460837332886,
+ "alias": " - other"
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.6932076698082548,
+ "acc_stderr,none": 0.008165633016061928,
+ "alias": " - social sciences"
+ },
+ "mmlu_stem": {
+ "acc,none": 0.5223596574690771,
+ "acc_stderr,none": 0.00855240247531941,
+ "alias": " - stem"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_humanities": [
+ "mmlu_moral_disputes",
+ "mmlu_international_law",
+ "mmlu_professional_law",
+ "mmlu_high_school_european_history",
+ "mmlu_world_religions",
+ "mmlu_logical_fallacies",
+ "mmlu_formal_logic",
+ "mmlu_high_school_world_history",
+ "mmlu_philosophy",
+ "mmlu_jurisprudence",
+ "mmlu_moral_scenarios",
+ "mmlu_high_school_us_history",
+ "mmlu_prehistory"
+ ],
+ "mmlu_social_sciences": [
+ "mmlu_us_foreign_policy",
+ "mmlu_high_school_geography",
+ "mmlu_public_relations",
+ "mmlu_high_school_microeconomics",
+ "mmlu_high_school_psychology",
+ "mmlu_high_school_government_and_politics",
+ "mmlu_high_school_macroeconomics",
+ "mmlu_human_sexuality",
+ "mmlu_professional_psychology",
+ "mmlu_econometrics",
+ "mmlu_security_studies",
+ "mmlu_sociology"
+ ],
+ "mmlu_other": [
+ "mmlu_virology",
+ "mmlu_medical_genetics",
+ "mmlu_professional_medicine",
+ "mmlu_professional_accounting",
+ "mmlu_global_facts",
+ "mmlu_nutrition",
+ "mmlu_business_ethics",
+ "mmlu_miscellaneous",
+ "mmlu_marketing",
+ "mmlu_human_aging",
+ "mmlu_college_medicine",
+ "mmlu_management",
+ "mmlu_clinical_knowledge"
+ ],
+ "mmlu_stem": [
+ "mmlu_high_school_mathematics",
+ "mmlu_college_physics",
+ "mmlu_college_mathematics",
+ "mmlu_college_computer_science",
+ "mmlu_high_school_biology",
+ "mmlu_astronomy",
+ "mmlu_anatomy",
+ "mmlu_elementary_mathematics",
+ "mmlu_high_school_computer_science",
+ "mmlu_college_chemistry",
+ "mmlu_abstract_algebra",
+ "mmlu_conceptual_physics",
+ "mmlu_high_school_physics",
+ "mmlu_college_biology",
+ "mmlu_machine_learning",
+ "mmlu_electrical_engineering",
+ "mmlu_computer_security",
+ "mmlu_high_school_statistics",
+ "mmlu_high_school_chemistry"
+ ],
+ "mmlu": [
+ "mmlu_stem",
+ "mmlu_other",
+ "mmlu_social_sciences",
+ "mmlu_humanities"
+ ]
+ },
+ "configs": {
+ "mmlu_abstract_algebra": {
+ "task": "mmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "abstract_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about abstract algebra.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_anatomy": {
+ "task": "mmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "anatomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about anatomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_astronomy": {
+ "task": "mmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "astronomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about astronomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_business_ethics": {
+ "task": "mmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "business_ethics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about business ethics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_clinical_knowledge": {
+ "task": "mmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "clinical_knowledge",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about clinical knowledge.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_biology": {
+ "task": "mmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_chemistry": {
+ "task": "mmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_computer_science": {
+ "task": "mmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_mathematics": {
+ "task": "mmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_medicine": {
+ "task": "mmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_physics": {
+ "task": "mmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_computer_security": {
+ "task": "mmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "computer_security",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about computer security.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_conceptual_physics": {
+ "task": "mmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "conceptual_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about conceptual physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_econometrics": {
+ "task": "mmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "econometrics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about econometrics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_electrical_engineering": {
+ "task": "mmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "electrical_engineering",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about electrical engineering.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_elementary_mathematics": {
+ "task": "mmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "elementary_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about elementary mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_formal_logic": {
+ "task": "mmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "formal_logic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about formal logic.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_global_facts": {
+ "task": "mmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "global_facts",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about global facts.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_biology": {
+ "task": "mmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_chemistry": {
+ "task": "mmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_computer_science": {
+ "task": "mmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_european_history": {
+ "task": "mmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_european_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school european history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_geography": {
+ "task": "mmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_geography",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school geography.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_government_and_politics": {
+ "task": "mmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_government_and_politics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school government and politics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_macroeconomics": {
+ "task": "mmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_macroeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school macroeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_mathematics": {
+ "task": "mmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_microeconomics": {
+ "task": "mmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_microeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school microeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_physics": {
+ "task": "mmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_psychology": {
+ "task": "mmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_statistics": {
+ "task": "mmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_statistics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school statistics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_us_history": {
+ "task": "mmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_us_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school us history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_world_history": {
+ "task": "mmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_world_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school world history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_aging": {
+ "task": "mmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_aging",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human aging.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_sexuality": {
+ "task": "mmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_sexuality",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human sexuality.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_international_law": {
+ "task": "mmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "international_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about international law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_jurisprudence": {
+ "task": "mmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "jurisprudence",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about jurisprudence.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_logical_fallacies": {
+ "task": "mmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "logical_fallacies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about logical fallacies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_machine_learning": {
+ "task": "mmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "machine_learning",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about machine learning.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_management": {
+ "task": "mmlu_management",
+ "task_alias": "management",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "management",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about management.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_marketing": {
+ "task": "mmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "marketing",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about marketing.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_medical_genetics": {
+ "task": "mmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "medical_genetics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about medical genetics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_miscellaneous": {
+ "task": "mmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "miscellaneous",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about miscellaneous.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_disputes": {
+ "task": "mmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_disputes",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral disputes.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_scenarios": {
+ "task": "mmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_scenarios",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral scenarios.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_nutrition": {
+ "task": "mmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "nutrition",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about nutrition.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_philosophy": {
+ "task": "mmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "philosophy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about philosophy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_prehistory": {
+ "task": "mmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "prehistory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about prehistory.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_accounting": {
+ "task": "mmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_accounting",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional accounting.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_law": {
+ "task": "mmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_medicine": {
+ "task": "mmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_psychology": {
+ "task": "mmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_public_relations": {
+ "task": "mmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "public_relations",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about public relations.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_security_studies": {
+ "task": "mmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "security_studies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about security studies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_sociology": {
+ "task": "mmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "sociology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about sociology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_us_foreign_policy": {
+ "task": "mmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "us_foreign_policy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about us foreign policy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_virology": {
+ "task": "mmlu_virology",
+ "task_alias": "virology",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "virology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about virology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_world_religions": {
+ "task": "mmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "world_religions",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about world religions.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu": 2,
+ "mmlu_abstract_algebra": 1.0,
+ "mmlu_anatomy": 1.0,
+ "mmlu_astronomy": 1.0,
+ "mmlu_business_ethics": 1.0,
+ "mmlu_clinical_knowledge": 1.0,
+ "mmlu_college_biology": 1.0,
+ "mmlu_college_chemistry": 1.0,
+ "mmlu_college_computer_science": 1.0,
+ "mmlu_college_mathematics": 1.0,
+ "mmlu_college_medicine": 1.0,
+ "mmlu_college_physics": 1.0,
+ "mmlu_computer_security": 1.0,
+ "mmlu_conceptual_physics": 1.0,
+ "mmlu_econometrics": 1.0,
+ "mmlu_electrical_engineering": 1.0,
+ "mmlu_elementary_mathematics": 1.0,
+ "mmlu_formal_logic": 1.0,
+ "mmlu_global_facts": 1.0,
+ "mmlu_high_school_biology": 1.0,
+ "mmlu_high_school_chemistry": 1.0,
+ "mmlu_high_school_computer_science": 1.0,
+ "mmlu_high_school_european_history": 1.0,
+ "mmlu_high_school_geography": 1.0,
+ "mmlu_high_school_government_and_politics": 1.0,
+ "mmlu_high_school_macroeconomics": 1.0,
+ "mmlu_high_school_mathematics": 1.0,
+ "mmlu_high_school_microeconomics": 1.0,
+ "mmlu_high_school_physics": 1.0,
+ "mmlu_high_school_psychology": 1.0,
+ "mmlu_high_school_statistics": 1.0,
+ "mmlu_high_school_us_history": 1.0,
+ "mmlu_high_school_world_history": 1.0,
+ "mmlu_human_aging": 1.0,
+ "mmlu_human_sexuality": 1.0,
+ "mmlu_humanities": 2,
+ "mmlu_international_law": 1.0,
+ "mmlu_jurisprudence": 1.0,
+ "mmlu_logical_fallacies": 1.0,
+ "mmlu_machine_learning": 1.0,
+ "mmlu_management": 1.0,
+ "mmlu_marketing": 1.0,
+ "mmlu_medical_genetics": 1.0,
+ "mmlu_miscellaneous": 1.0,
+ "mmlu_moral_disputes": 1.0,
+ "mmlu_moral_scenarios": 1.0,
+ "mmlu_nutrition": 1.0,
+ "mmlu_other": 2,
+ "mmlu_philosophy": 1.0,
+ "mmlu_prehistory": 1.0,
+ "mmlu_professional_accounting": 1.0,
+ "mmlu_professional_law": 1.0,
+ "mmlu_professional_medicine": 1.0,
+ "mmlu_professional_psychology": 1.0,
+ "mmlu_public_relations": 1.0,
+ "mmlu_security_studies": 1.0,
+ "mmlu_social_sciences": 2,
+ "mmlu_sociology": 1.0,
+ "mmlu_stem": 2,
+ "mmlu_us_foreign_policy": 1.0,
+ "mmlu_virology": 1.0,
+ "mmlu_world_religions": 1.0
+ },
+ "n-shot": {
+ "mmlu_abstract_algebra": 0,
+ "mmlu_anatomy": 0,
+ "mmlu_astronomy": 0,
+ "mmlu_business_ethics": 0,
+ "mmlu_clinical_knowledge": 0,
+ "mmlu_college_biology": 0,
+ "mmlu_college_chemistry": 0,
+ "mmlu_college_computer_science": 0,
+ "mmlu_college_mathematics": 0,
+ "mmlu_college_medicine": 0,
+ "mmlu_college_physics": 0,
+ "mmlu_computer_security": 0,
+ "mmlu_conceptual_physics": 0,
+ "mmlu_econometrics": 0,
+ "mmlu_electrical_engineering": 0,
+ "mmlu_elementary_mathematics": 0,
+ "mmlu_formal_logic": 0,
+ "mmlu_global_facts": 0,
+ "mmlu_high_school_biology": 0,
+ "mmlu_high_school_chemistry": 0,
+ "mmlu_high_school_computer_science": 0,
+ "mmlu_high_school_european_history": 0,
+ "mmlu_high_school_geography": 0,
+ "mmlu_high_school_government_and_politics": 0,
+ "mmlu_high_school_macroeconomics": 0,
+ "mmlu_high_school_mathematics": 0,
+ "mmlu_high_school_microeconomics": 0,
+ "mmlu_high_school_physics": 0,
+ "mmlu_high_school_psychology": 0,
+ "mmlu_high_school_statistics": 0,
+ "mmlu_high_school_us_history": 0,
+ "mmlu_high_school_world_history": 0,
+ "mmlu_human_aging": 0,
+ "mmlu_human_sexuality": 0,
+ "mmlu_international_law": 0,
+ "mmlu_jurisprudence": 0,
+ "mmlu_logical_fallacies": 0,
+ "mmlu_machine_learning": 0,
+ "mmlu_management": 0,
+ "mmlu_marketing": 0,
+ "mmlu_medical_genetics": 0,
+ "mmlu_miscellaneous": 0,
+ "mmlu_moral_disputes": 0,
+ "mmlu_moral_scenarios": 0,
+ "mmlu_nutrition": 0,
+ "mmlu_philosophy": 0,
+ "mmlu_prehistory": 0,
+ "mmlu_professional_accounting": 0,
+ "mmlu_professional_law": 0,
+ "mmlu_professional_medicine": 0,
+ "mmlu_professional_psychology": 0,
+ "mmlu_public_relations": 0,
+ "mmlu_security_studies": 0,
+ "mmlu_sociology": 0,
+ "mmlu_us_foreign_policy": 0,
+ "mmlu_virology": 0,
+ "mmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "mmlu": {
+ "acc": true
+ },
+ "mmlu_abstract_algebra": {
+ "acc": true
+ },
+ "mmlu_anatomy": {
+ "acc": true
+ },
+ "mmlu_astronomy": {
+ "acc": true
+ },
+ "mmlu_business_ethics": {
+ "acc": true
+ },
+ "mmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "mmlu_college_biology": {
+ "acc": true
+ },
+ "mmlu_college_chemistry": {
+ "acc": true
+ },
+ "mmlu_college_computer_science": {
+ "acc": true
+ },
+ "mmlu_college_mathematics": {
+ "acc": true
+ },
+ "mmlu_college_medicine": {
+ "acc": true
+ },
+ "mmlu_college_physics": {
+ "acc": true
+ },
+ "mmlu_computer_security": {
+ "acc": true
+ },
+ "mmlu_conceptual_physics": {
+ "acc": true
+ },
+ "mmlu_econometrics": {
+ "acc": true
+ },
+ "mmlu_electrical_engineering": {
+ "acc": true
+ },
+ "mmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "mmlu_formal_logic": {
+ "acc": true
+ },
+ "mmlu_global_facts": {
+ "acc": true
+ },
+ "mmlu_high_school_biology": {
+ "acc": true
+ },
+ "mmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "mmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "mmlu_high_school_european_history": {
+ "acc": true
+ },
+ "mmlu_high_school_geography": {
+ "acc": true
+ },
+ "mmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "mmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "mmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_physics": {
+ "acc": true
+ },
+ "mmlu_high_school_psychology": {
+ "acc": true
+ },
+ "mmlu_high_school_statistics": {
+ "acc": true
+ },
+ "mmlu_high_school_us_history": {
+ "acc": true
+ },
+ "mmlu_high_school_world_history": {
+ "acc": true
+ },
+ "mmlu_human_aging": {
+ "acc": true
+ },
+ "mmlu_human_sexuality": {
+ "acc": true
+ },
+ "mmlu_humanities": {
+ "acc": true
+ },
+ "mmlu_international_law": {
+ "acc": true
+ },
+ "mmlu_jurisprudence": {
+ "acc": true
+ },
+ "mmlu_logical_fallacies": {
+ "acc": true
+ },
+ "mmlu_machine_learning": {
+ "acc": true
+ },
+ "mmlu_management": {
+ "acc": true
+ },
+ "mmlu_marketing": {
+ "acc": true
+ },
+ "mmlu_medical_genetics": {
+ "acc": true
+ },
+ "mmlu_miscellaneous": {
+ "acc": true
+ },
+ "mmlu_moral_disputes": {
+ "acc": true
+ },
+ "mmlu_moral_scenarios": {
+ "acc": true
+ },
+ "mmlu_nutrition": {
+ "acc": true
+ },
+ "mmlu_other": {
+ "acc": true
+ },
+ "mmlu_philosophy": {
+ "acc": true
+ },
+ "mmlu_prehistory": {
+ "acc": true
+ },
+ "mmlu_professional_accounting": {
+ "acc": true
+ },
+ "mmlu_professional_law": {
+ "acc": true
+ },
+ "mmlu_professional_medicine": {
+ "acc": true
+ },
+ "mmlu_professional_psychology": {
+ "acc": true
+ },
+ "mmlu_public_relations": {
+ "acc": true
+ },
+ "mmlu_security_studies": {
+ "acc": true
+ },
+ "mmlu_social_sciences": {
+ "acc": true
+ },
+ "mmlu_sociology": {
+ "acc": true
+ },
+ "mmlu_stem": {
+ "acc": true
+ },
+ "mmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "mmlu_virology": {
+ "acc": true
+ },
+ "mmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "mmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "mmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "mmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "mmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "mmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "mmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "mmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "mmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "mmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "mmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "mmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "mmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "mmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "mmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "mmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "mmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "mmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "mmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "mmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "mmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "mmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "mmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "mmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "mmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "mmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "mmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "mmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "mmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "mmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "mmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "mmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "mmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "mmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "mmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "mmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "mmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "mmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "mmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "mmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "mmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "mmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "mmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "mmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "mmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "mmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "mmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "mmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7000559616,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 64
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735691184.506562,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "hf",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 37362.382822608,
+ "end_time": 37647.531273873,
+ "total_evaluation_time_seconds": "285.1484512649986"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/mmlu_pro_5_shot.json b/evaluations/en/Allam-7b-instruct-preview/mmlu_pro_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..3c362cb0cf4cfd6cac72276e532bb18ab6bca99a
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/mmlu_pro_5_shot.json
@@ -0,0 +1,1103 @@
+{
+ "results": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.30402260638297873,
+ "exact_match_stderr,custom-extract": 0.004039726453364688,
+ "alias": "mmlu_pro"
+ },
+ "mmlu_pro_biology": {
+ "alias": " - biology",
+ "exact_match,custom-extract": 0.5913528591352859,
+ "exact_match_stderr,custom-extract": 0.01837135002048438
+ },
+ "mmlu_pro_business": {
+ "alias": " - business",
+ "exact_match,custom-extract": 0.30038022813688214,
+ "exact_match_stderr,custom-extract": 0.01633065484500373
+ },
+ "mmlu_pro_chemistry": {
+ "alias": " - chemistry",
+ "exact_match,custom-extract": 0.1413427561837456,
+ "exact_match_stderr,custom-extract": 0.010358941833675094
+ },
+ "mmlu_pro_computer_science": {
+ "alias": " - computer_science",
+ "exact_match,custom-extract": 0.28780487804878047,
+ "exact_match_stderr,custom-extract": 0.022386537072601277
+ },
+ "mmlu_pro_economics": {
+ "alias": " - economics",
+ "exact_match,custom-extract": 0.4419431279620853,
+ "exact_match_stderr,custom-extract": 0.01710443116191488
+ },
+ "mmlu_pro_engineering": {
+ "alias": " - engineering",
+ "exact_match,custom-extract": 0.18163054695562436,
+ "exact_match_stderr,custom-extract": 0.012391716581781865
+ },
+ "mmlu_pro_health": {
+ "alias": " - health",
+ "exact_match,custom-extract": 0.37897310513447435,
+ "exact_match_stderr,custom-extract": 0.016972599803423114
+ },
+ "mmlu_pro_history": {
+ "alias": " - history",
+ "exact_match,custom-extract": 0.3333333333333333,
+ "exact_match_stderr,custom-extract": 0.02418254167033376
+ },
+ "mmlu_pro_law": {
+ "alias": " - law",
+ "exact_match,custom-extract": 0.2089009990917348,
+ "exact_match_stderr,custom-extract": 0.01225714528792418
+ },
+ "mmlu_pro_math": {
+ "alias": " - math",
+ "exact_match,custom-extract": 0.26054774241302736,
+ "exact_match_stderr,custom-extract": 0.01194625669982662
+ },
+ "mmlu_pro_other": {
+ "alias": " - other",
+ "exact_match,custom-extract": 0.3777056277056277,
+ "exact_match_stderr,custom-extract": 0.015957829261529097
+ },
+ "mmlu_pro_philosophy": {
+ "alias": " - philosophy",
+ "exact_match,custom-extract": 0.28857715430861725,
+ "exact_match_stderr,custom-extract": 0.020303934586139317
+ },
+ "mmlu_pro_physics": {
+ "alias": " - physics",
+ "exact_match,custom-extract": 0.20092378752886836,
+ "exact_match_stderr,custom-extract": 0.0111217321903404
+ },
+ "mmlu_pro_psychology": {
+ "alias": " - psychology",
+ "exact_match,custom-extract": 0.4974937343358396,
+ "exact_match_stderr,custom-extract": 0.01771068617554264
+ }
+ },
+ "groups": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.30402260638297873,
+ "exact_match_stderr,custom-extract": 0.004039726453364688,
+ "alias": "mmlu_pro"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_pro": [
+ "mmlu_pro_biology",
+ "mmlu_pro_business",
+ "mmlu_pro_chemistry",
+ "mmlu_pro_computer_science",
+ "mmlu_pro_economics",
+ "mmlu_pro_engineering",
+ "mmlu_pro_health",
+ "mmlu_pro_history",
+ "mmlu_pro_law",
+ "mmlu_pro_math",
+ "mmlu_pro_other",
+ "mmlu_pro_philosophy",
+ "mmlu_pro_physics",
+ "mmlu_pro_psychology"
+ ]
+ },
+ "configs": {
+ "mmlu_pro_biology": {
+ "task": "mmlu_pro_biology",
+ "task_alias": "biology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='biology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_business": {
+ "task": "mmlu_pro_business",
+ "task_alias": "business",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='business')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_chemistry": {
+ "task": "mmlu_pro_chemistry",
+ "task_alias": "chemistry",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='chemistry')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_computer_science": {
+ "task": "mmlu_pro_computer_science",
+ "task_alias": "computer_science",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='computer science')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_economics": {
+ "task": "mmlu_pro_economics",
+ "task_alias": "economics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='economics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_engineering": {
+ "task": "mmlu_pro_engineering",
+ "task_alias": "engineering",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='engineering')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_health": {
+ "task": "mmlu_pro_health",
+ "task_alias": "health",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='health')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_history": {
+ "task": "mmlu_pro_history",
+ "task_alias": "history",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='history')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_law": {
+ "task": "mmlu_pro_law",
+ "task_alias": "law",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='law')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_math": {
+ "task": "mmlu_pro_math",
+ "task_alias": "math",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='math')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_other": {
+ "task": "mmlu_pro_other",
+ "task_alias": "other",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='other')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_philosophy": {
+ "task": "mmlu_pro_philosophy",
+ "task_alias": "philosophy",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='philosophy')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_physics": {
+ "task": "mmlu_pro_physics",
+ "task_alias": "physics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='physics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_psychology": {
+ "task": "mmlu_pro_psychology",
+ "task_alias": "psychology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='psychology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu_pro": 2.0,
+ "mmlu_pro_biology": 1.0,
+ "mmlu_pro_business": 1.0,
+ "mmlu_pro_chemistry": 1.0,
+ "mmlu_pro_computer_science": 1.0,
+ "mmlu_pro_economics": 1.0,
+ "mmlu_pro_engineering": 1.0,
+ "mmlu_pro_health": 1.0,
+ "mmlu_pro_history": 1.0,
+ "mmlu_pro_law": 1.0,
+ "mmlu_pro_math": 1.0,
+ "mmlu_pro_other": 1.0,
+ "mmlu_pro_philosophy": 1.0,
+ "mmlu_pro_physics": 1.0,
+ "mmlu_pro_psychology": 1.0
+ },
+ "n-shot": {
+ "mmlu_pro_biology": 5,
+ "mmlu_pro_business": 5,
+ "mmlu_pro_chemistry": 5,
+ "mmlu_pro_computer_science": 5,
+ "mmlu_pro_economics": 5,
+ "mmlu_pro_engineering": 5,
+ "mmlu_pro_health": 5,
+ "mmlu_pro_history": 5,
+ "mmlu_pro_law": 5,
+ "mmlu_pro_math": 5,
+ "mmlu_pro_other": 5,
+ "mmlu_pro_philosophy": 5,
+ "mmlu_pro_physics": 5,
+ "mmlu_pro_psychology": 5
+ },
+ "higher_is_better": {
+ "mmlu_pro": {
+ "exact_match": true
+ },
+ "mmlu_pro_biology": {
+ "exact_match": true
+ },
+ "mmlu_pro_business": {
+ "exact_match": true
+ },
+ "mmlu_pro_chemistry": {
+ "exact_match": true
+ },
+ "mmlu_pro_computer_science": {
+ "exact_match": true
+ },
+ "mmlu_pro_economics": {
+ "exact_match": true
+ },
+ "mmlu_pro_engineering": {
+ "exact_match": true
+ },
+ "mmlu_pro_health": {
+ "exact_match": true
+ },
+ "mmlu_pro_history": {
+ "exact_match": true
+ },
+ "mmlu_pro_law": {
+ "exact_match": true
+ },
+ "mmlu_pro_math": {
+ "exact_match": true
+ },
+ "mmlu_pro_other": {
+ "exact_match": true
+ },
+ "mmlu_pro_philosophy": {
+ "exact_match": true
+ },
+ "mmlu_pro_physics": {
+ "exact_match": true
+ },
+ "mmlu_pro_psychology": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "mmlu_pro_biology": {
+ "original": 717,
+ "effective": 717
+ },
+ "mmlu_pro_business": {
+ "original": 789,
+ "effective": 789
+ },
+ "mmlu_pro_chemistry": {
+ "original": 1132,
+ "effective": 1132
+ },
+ "mmlu_pro_computer_science": {
+ "original": 410,
+ "effective": 410
+ },
+ "mmlu_pro_economics": {
+ "original": 844,
+ "effective": 844
+ },
+ "mmlu_pro_engineering": {
+ "original": 969,
+ "effective": 969
+ },
+ "mmlu_pro_health": {
+ "original": 818,
+ "effective": 818
+ },
+ "mmlu_pro_history": {
+ "original": 381,
+ "effective": 381
+ },
+ "mmlu_pro_law": {
+ "original": 1101,
+ "effective": 1101
+ },
+ "mmlu_pro_math": {
+ "original": 1351,
+ "effective": 1351
+ },
+ "mmlu_pro_other": {
+ "original": 924,
+ "effective": 924
+ },
+ "mmlu_pro_philosophy": {
+ "original": 499,
+ "effective": 499
+ },
+ "mmlu_pro_physics": {
+ "original": 1299,
+ "effective": 1299
+ },
+ "mmlu_pro_psychology": {
+ "original": 798,
+ "effective": 798
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=8,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1738825553.1567993,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA H100 80GB HBM3\nGPU 1: NVIDIA H100 80GB HBM3\nGPU 2: NVIDIA H100 80GB HBM3\nGPU 3: NVIDIA H100 80GB HBM3\nGPU 4: NVIDIA H100 80GB HBM3\nGPU 5: NVIDIA H100 80GB HBM3\nGPU 6: NVIDIA H100 80GB HBM3\nGPU 7: NVIDIA H100 80GB HBM3\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 46 bits physical, 57 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: GenuineIntel\nModel name: Intel(R) Xeon(R) Platinum 8480C\nCPU family: 6\nModel: 143\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 8\nBogoMIPS: 3999.99\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 avx512vbmi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 4.5 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 192 MiB (96 instances)\nL3 cache: 210 MiB (2 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-47\nNUMA node1 CPU(s): 48-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Unknown: No mitigations\nVulnerability Retbleed: Vulnerable\nVulnerability Spec rstack overflow: Not affected\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.2",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {
+ "mmlu_pro_biology": "78a27f3d4ea386dd0f7b5045f25bf654ba560ee9feac7b22eab763c73b4c37b9",
+ "mmlu_pro_business": "9d10f8702f23d8d5aa9546ebf453e9333a6998a272450bc468b8f74bca8a1824",
+ "mmlu_pro_chemistry": "0e3a8823fed7bd895e42f5053851f12b125f62edfcb36809e4c0aebec80f4506",
+ "mmlu_pro_computer_science": "26e8d9026807a7552684e4ddd1a373873449548e0f0ac8abeada18f32cc5f685",
+ "mmlu_pro_economics": "427580d476e69dc8f095f487f3081cbff1dbfdd3a05a4c13c024ae5bd6907262",
+ "mmlu_pro_engineering": "66bc34b22bf2c19eab04a753e65e8aea2e6834544b27516a6aa2769a9be0b9e5",
+ "mmlu_pro_health": "62edd914028ea5b83013192e458af0d22b843d25ce0ac6e280244d819615cdc4",
+ "mmlu_pro_history": "8295796e4901f2a6b42a2bd8b6e888f2e64ae24ce451f8ecef70db6351f3583d",
+ "mmlu_pro_law": "6969a0ecb6ac565ee29e658094231ddcf1016237aff3d903f5d219dd68a2e5dd",
+ "mmlu_pro_math": "eb48989afd83cb45e2dfd8c769fbe986927de9eb06ac775a7237e939150f20ec",
+ "mmlu_pro_other": "82e12fde3ce84ca4d478ce4623e9dd3877b8bd46c7fc1346c3d9e534df9cbba3",
+ "mmlu_pro_philosophy": "1cd86d5d342a6029560af9a2d51e397df4f537d81d4e6249a0917267c91073e1",
+ "mmlu_pro_physics": "dce786711af6f503b9b1463ca9e245de515859363f4ee7f0aa94656c3357a288",
+ "mmlu_pro_psychology": "526f25dba79a26df39f911b7d6010990c8e21d7c473c89a94e4298566d7cdeda"
+ },
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1202653.797843331,
+ "end_time": 1202895.910935028,
+ "total_evaluation_time_seconds": "242.11309169698507"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/triviaqa_5_shot.json b/evaluations/en/Allam-7b-instruct-preview/triviaqa_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..02cf0797e0a6d725fb41a32b01bc28a3495abd75
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/triviaqa_5_shot.json
@@ -0,0 +1,128 @@
+{
+ "results": {
+ "triviaqa": {
+ "alias": "triviaqa",
+ "exact_match,remove_whitespace": 0.16066651805617477,
+ "exact_match_stderr,remove_whitespace": 0.002741463299754975
+ }
+ },
+ "group_subtasks": {
+ "triviaqa": []
+ },
+ "configs": {
+ "triviaqa": {
+ "task": "triviaqa",
+ "dataset_path": "trivia_qa",
+ "dataset_name": "rc.nocontext",
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "Question: {{question}}?\nAnswer:",
+ "doc_to_target": "{{answer.aliases}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "\n",
+ ".",
+ ","
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "remove_whitespace",
+ "filter": [
+ {
+ "function": "remove_whitespace"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "triviaqa": 3.0
+ },
+ "n-shot": {
+ "triviaqa": 5
+ },
+ "higher_is_better": {
+ "triviaqa": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "triviaqa": {
+ "original": 17944,
+ "effective": 17944
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737544037.6055677,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 21582.583321473,
+ "end_time": 21855.449312492,
+ "total_evaluation_time_seconds": "272.8659910189999"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/truthfulqa_mc2_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/truthfulqa_mc2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..186c45866429044b01d0cab98fb9b3a187f1b00b
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/truthfulqa_mc2_0_shot.json
@@ -0,0 +1,108 @@
+{
+ "results": {
+ "truthfulqa_mc2": {
+ "alias": "truthfulqa_mc2",
+ "acc,none": 0.4667466051524712,
+ "acc_stderr,none": 0.015605585169281691
+ }
+ },
+ "group_subtasks": {
+ "truthfulqa_mc2": []
+ },
+ "configs": {
+ "truthfulqa_mc2": {
+ "task": "truthfulqa_mc2",
+ "tag": [
+ "truthfulqa"
+ ],
+ "dataset_path": "truthful_qa",
+ "dataset_name": "multiple_choice",
+ "validation_split": "validation",
+ "doc_to_text": "{% set prompt_qa = 'Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: I have no comment.\n\nQ: How does a telescope work?\nA: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\nQ: Where were the 1992 Olympics held?\nA: The 1992 Olympics were held in Barcelona, Spain.'%}{{prompt_qa + '\n\nQ: ' + question + '\nA:'}}",
+ "doc_to_target": 0,
+ "doc_to_choice": "{{mc2_targets.choices}}",
+ "process_results": "def process_results_mc2(doc, results):\n lls, is_greedy = zip(*results)\n\n # Split on the first `0` as everything before it is true (`1`).\n split_idx = list(doc[\"mc2_targets\"][\"labels\"]).index(0)\n # Compute the normalized probability mass for the correct answer.\n ll_true, ll_false = lls[:split_idx], lls[split_idx:]\n p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))\n p_true = p_true / (sum(p_true) + sum(p_false))\n\n return {\"acc\": sum(p_true)}\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "truthfulqa_mc2": 2.0
+ },
+ "n-shot": {
+ "truthfulqa_mc2": 0
+ },
+ "higher_is_better": {
+ "truthfulqa_mc2": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "truthfulqa_mc2": {
+ "original": 817,
+ "effective": 817
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735957764.7570622,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 24434.078025398,
+ "end_time": 24545.624577618,
+ "total_evaluation_time_seconds": "111.54655221999928"
+}
\ No newline at end of file
diff --git a/evaluations/en/Allam-7b-instruct-preview/winogrande_0_shot.json b/evaluations/en/Allam-7b-instruct-preview/winogrande_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..9be51646a132a6cb621256e91538ee830a910152
--- /dev/null
+++ b/evaluations/en/Allam-7b-instruct-preview/winogrande_0_shot.json
@@ -0,0 +1,108 @@
+{
+ "results": {
+ "winogrande": {
+ "alias": "winogrande",
+ "acc,none": 0.7048145224940805,
+ "acc_stderr,none": 0.012819410741754765
+ }
+ },
+ "group_subtasks": {
+ "winogrande": []
+ },
+ "configs": {
+ "winogrande": {
+ "task": "winogrande",
+ "dataset_path": "winogrande",
+ "dataset_name": "winogrande_xl",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "def doc_to_text(doc):\n answer_to_num = {\"1\": 0, \"2\": 1}\n return answer_to_num[doc[\"answer\"]]\n",
+ "doc_to_target": "def doc_to_target(doc):\n idx = doc[\"sentence\"].index(\"_\") + 1\n return doc[\"sentence\"][idx:].strip()\n",
+ "doc_to_choice": "def doc_to_choice(doc):\n idx = doc[\"sentence\"].index(\"_\")\n options = [doc[\"option1\"], doc[\"option2\"]]\n return [doc[\"sentence\"][:idx] + opt for opt in options]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "sentence",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "winogrande": 1.0
+ },
+ "n-shot": {
+ "winogrande": 0
+ },
+ "higher_is_better": {
+ "winogrande": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "winogrande": {
+ "original": 1267,
+ "effective": 1267
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.5,download_dir=/tmp",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "8e1bd48d",
+ "date": 1735957928.9213855,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.90\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.47.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "",
+ "0"
+ ],
+ "tokenizer_eos_token": [
+ "",
+ "2"
+ ],
+ "tokenizer_bos_token": [
+ "",
+ "1"
+ ],
+ "eot_token_id": 2,
+ "max_length": 4096,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 24598.479043164,
+ "end_time": 24674.97354231,
+ "total_evaluation_time_seconds": "76.49449914599973"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/agieval_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ae6fecc6dc8620ccc643dbc58626727450b92473
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/agieval_0_shot.json
@@ -0,0 +1,1134 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.4384373488147073,
+ "acc_stderr,none": 0.005138774874733036,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.40551181102362205,
+ "acc_stderr,none": 0.030868328175712653,
+ "acc_norm,none": 0.38976377952755903,
+ "acc_norm_stderr,none": 0.030661222674142036
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.48095238095238096,
+ "acc_stderr,none": 0.034560617865111484,
+ "acc_norm,none": 0.4714285714285714,
+ "acc_norm_stderr,none": 0.03452921053595503
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.42028985507246375,
+ "acc_stderr,none": 0.034391117954401376,
+ "acc_norm,none": 0.3961352657004831,
+ "acc_norm_stderr,none": 0.0340767350076416
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.4186991869918699,
+ "acc_stderr,none": 0.03151871344392194,
+ "acc_norm,none": 0.42276422764227645,
+ "acc_norm_stderr,none": 0.03156041407531481
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.6993464052287581,
+ "acc_stderr,none": 0.02625605383571896,
+ "acc_norm,none": 0.738562091503268,
+ "acc_norm_stderr,none": 0.025160998214292456
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.5477386934673367,
+ "acc_stderr,none": 0.03537112167025914,
+ "acc_norm,none": 0.542713567839196,
+ "acc_norm_stderr,none": 0.035403557368657
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.4553191489361702,
+ "acc_stderr,none": 0.03255525359340355,
+ "acc_norm,none": 0.44680851063829785,
+ "acc_norm_stderr,none": 0.0325005368436584
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.09322033898305085,
+ "acc_stderr,none": 0.02687901150866995
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.32763532763532766,
+ "acc_stderr,none": 0.025087869562833914,
+ "acc_norm,none": 0.32763532763532766,
+ "acc_norm_stderr,none": 0.025087869562833914
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.48,
+ "acc_stderr,none": 0.03541569365103447,
+ "acc_norm,none": 0.455,
+ "acc_norm_stderr,none": 0.03530021993753286
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.5085085085085085,
+ "acc_stderr,none": 0.01582493166517233,
+ "acc_norm,none": 0.5105105105105106,
+ "acc_norm_stderr,none": 0.015823726166373807
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.562,
+ "acc_stderr,none": 0.01569721001969469,
+ "acc_norm,none": 0.553,
+ "acc_norm_stderr,none": 0.015730176046009074
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.402457757296467,
+ "acc_stderr,none": 0.01923480462752409,
+ "acc_norm,none": 0.4055299539170507,
+ "acc_norm_stderr,none": 0.019258381208154273
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.4009216589861751,
+ "acc_stderr,none": 0.01922272222545092,
+ "acc_norm,none": 0.40706605222734255,
+ "acc_norm_stderr,none": 0.01926987610639943
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.2217391304347826,
+ "acc_stderr,none": 0.027451496604058916,
+ "acc_norm,none": 0.2217391304347826,
+ "acc_norm_stderr,none": 0.02745149660405892
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.5372549019607843,
+ "acc_stderr,none": 0.022100505922784033,
+ "acc_norm,none": 0.49607843137254903,
+ "acc_norm_stderr,none": 0.022161428699498387
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.6654275092936803,
+ "acc_stderr,none": 0.028822264091264625,
+ "acc_norm,none": 0.6579925650557621,
+ "acc_norm_stderr,none": 0.028977497019824838
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.106,
+ "acc_stderr,none": 0.009739551265785134
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.8106796116504854,
+ "acc_stderr,none": 0.027361908621979958,
+ "acc_norm,none": 0.7961165048543689,
+ "acc_norm_stderr,none": 0.028138595623668772
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.4563106796116505,
+ "acc_stderr,none": 0.03478794599787744,
+ "acc_norm,none": 0.45145631067961167,
+ "acc_norm_stderr,none": 0.03475654072342856
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.5227272727272727,
+ "acc_stderr,none": 0.03375194708230163,
+ "acc_norm,none": 0.5,
+ "acc_norm_stderr,none": 0.033786868919974296
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.4384373488147073,
+ "acc_stderr,none": 0.005138774874733036,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736906617.337926,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "agieval_gaokao_biology": "19067f814ce4acb5c8b4db09600249eb11928dfeaabfb29026fbcc5aeae9bf6e",
+ "agieval_gaokao_chemistry": "2aeca40c247a4384598991ab7645d5d337bd76947d4c5256933e210a01b2b73c",
+ "agieval_gaokao_chinese": "11a6a9f458b461a70acda2dd2f424c7f68430c0ab9d8b1a62655e01cedda7fbe",
+ "agieval_gaokao_geography": "16f33d14fe56d3b156071286a973d378fdc31d2953e97910506a74ffa9deb726",
+ "agieval_gaokao_history": "812ddb5af1d5ee5b792434865d543e48911ac58dd98f58b28a1e55ebbd899933",
+ "agieval_gaokao_mathcloze": "75ecfccf5d9d01dcae7593e210c755e953d0f9e76634565a62fe40a4c08b02d7",
+ "agieval_gaokao_mathqa": "1a62d808a5c27751c285ba7f0d111de21b7bceddb3f180f2e12ea864ba0e3f21",
+ "agieval_gaokao_physics": "c4dca484c75b47142e23919123632aa6da66b7e4a5ea6cce3a5d2cf834039312",
+ "agieval_jec_qa_ca": "dc63435e7da4ca4da0c86837082ae6c95ae4f5e868a6e2e8e8c388fdb292829c",
+ "agieval_jec_qa_kd": "a60a905d40fceb91c419e45b42cc80f77ac0c8b2154795a3c27ea2c8717843da",
+ "agieval_logiqa_zh": "064313b20368e01816c3222904da40cd36813d6ce3a10492074f3134dd1e9a25",
+ "agieval_aqua_rat": "590732bf8f23653400bcc45709ef3aa17cc1eaa69d228cc1eabb11bd1b48600d",
+ "agieval_gaokao_english": "308d1ba44ed10ddf2626ace40f23a0700e31b7ca361fb77d683d103b9ab653ff",
+ "agieval_logiqa_en": "1a372f08810b63ad9abe4766c1ab68fd24f0a86f7604f08f32127bde985d9c29",
+ "agieval_lsat_ar": "177ca1fa872eb6221c8d697a1c6c49d44ca6989d11688348360bfbb9af5bb3dc",
+ "agieval_lsat_lr": "50bb8b6c692ee86cfab3e6b4617b246fb654c713ebd438497d11008626ee5cef",
+ "agieval_lsat_rc": "9c404a0b73f50b3f71b611aa3cf5d65542d5faad568abf9d85c41404504290a7",
+ "agieval_math": "846f11659e5b8569f30b18c66e21dc1b40368bf041133d68d5523dac0ae27853",
+ "agieval_sat_en_without_passage": "b249ac869804c4f6b1884c5b855302fab9acb3e9cad970c0398681ed514a38a2",
+ "agieval_sat_en": "86c34b77b2f5ea8353df8dabe480afcc613505e96de27ffd7aa132a9d725d6eb",
+ "agieval_sat_math": "1f5c90ed7628a8f9a0ea8a08290595417e73f3793e131a2aa13e9b3f62aa4798"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 616867.569233521,
+ "end_time": 617195.20891048,
+ "total_evaluation_time_seconds": "327.639676959021"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/arc_challenge_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..34b66c2e484822d4b1493d6bf7d5897356d1f8cd
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/arc_challenge_0_shot.json
@@ -0,0 +1,123 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.5571672354948806,
+ "acc_stderr,none": 0.014515573873348892,
+ "acc_norm,none": 0.5947098976109215,
+ "acc_norm_stderr,none": 0.01434686906022932
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736910183.5373647,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "arc_challenge": "a6a6d87aa680bdfdb3d3f0c716078b0dc58062b476f9c2d71adccaae38cf3e10"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 620433.885763592,
+ "end_time": 620496.540439545,
+ "total_evaluation_time_seconds": "62.654675952973776"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/gpqa_main_n_shot_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..5ad7f08bc37b331dee983f42c3b45405e6067a62
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,127 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.33705357142857145,
+ "acc_stderr,none": 0.02235810146577642,
+ "acc_norm,none": 0.33705357142857145,
+ "acc_norm_stderr,none": 0.02235810146577642
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": "auto",
+ "batch_sizes": [
+ 16
+ ],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737963526.1678772,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "gpqa_main_n_shot": "baab13c53a170f647515cafd634518b1d56d1b633ce63ab63ea081a49cbeed1a"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 47062.544835171,
+ "end_time": 47158.146115345,
+ "total_evaluation_time_seconds": "95.60128017399984"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/gsm8k_5_shot.json b/evaluations/en/Falcon3-7B-Instruct/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ba58f23d49f90cf6f5b59a5948a09e78e4a71dd7
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/gsm8k_5_shot.json
@@ -0,0 +1,159 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.7892342683851402,
+ "exact_match_stderr,strict-match": 0.011234280469030463,
+ "exact_match,flexible-extract": 0.7930250189537529,
+ "exact_match_stderr,flexible-extract": 0.011159498164891776
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736905859.2699218,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "gsm8k": "6dc2d8763af1e4661e72a6cdacb6cca4979ac315556ee509687d296da8051cc2"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 616109.524047477,
+ "end_time": 616801.085240661,
+ "total_evaluation_time_seconds": "691.5611931839958"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/hellaswag_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..37b6cf78b8e616c41658b31cd6eda717b4eabbc0
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/hellaswag_0_shot.json
@@ -0,0 +1,124 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.6032662816172077,
+ "acc_stderr,none": 0.004882200364432369,
+ "acc_norm,none": 0.7843059151563434,
+ "acc_norm_stderr,none": 0.004104623991846364
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736907020.9520104,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "hellaswag": "745f36a5a7a36e5192c010e2b43818ea1ff49739a6078fa6edbcf3bda680e5d7"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 617271.261912427,
+ "end_time": 617483.451207438,
+ "total_evaluation_time_seconds": "212.18929501099046"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/hendrycks_ethics_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..833a220de16ef040ef33c00126947c5e2cd6bc5c
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/hendrycks_ethics_0_shot.json
@@ -0,0 +1,317 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.6612612612612613,
+ "acc_stderr,none": 0.0075941533560203575
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.5583982202447163,
+ "acc_stderr,none": 0.008282052379666472
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.761094674556213,
+ "acc_stderr,none": 0.008201801118670663
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.6977953410981698,
+ "acc_stderr,none": 0.006623347622611029
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.8410050251256281,
+ "acc_stderr,none": 0.005184872773495539
+ }
+ },
+ "group_subtasks": {
+ "ethics_utilitarianism": [],
+ "ethics_cm": [],
+ "ethics_virtue": [],
+ "ethics_justice": [],
+ "ethics_deontology": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ },
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736907313.3535528,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "ethics_deontology": "fad716ad4c1ccd0a69441ec78ee32ad04fbb04860bb2ede33329ebab0abfcd10",
+ "ethics_justice": "56acebbfada763de5832f4f4909e2b869d3f8233cee8640cae597b0a7dad223f",
+ "ethics_virtue": "3ed05bb2eac3d0663eaa0167a92917b09d04e9f6a50860f15ed101bb44d2ada9",
+ "ethics_cm": "14434d2a2b63a82cf13037549649099091dfcec2a0629f8438d454973f93ef17",
+ "ethics_utilitarianism": "25d711a4b0687249905b9da23ba457930c817c472b4f53388427a6f679289c8d"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 617563.658377943,
+ "end_time": 617709.608623462,
+ "total_evaluation_time_seconds": "145.95024551905226"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/ifeval_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ee73d466ab0ba73e2bc0869d73c514953cbe33e1
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/ifeval_0_shot.json
@@ -0,0 +1,138 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.5600739371534196,
+ "prompt_level_strict_acc_stderr,none": 0.02136070822080198,
+ "inst_level_strict_acc,none": 0.6858513189448441,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.6266173752310537,
+ "prompt_level_loose_acc_stderr,none": 0.020815238376834504,
+ "inst_level_loose_acc,none": 0.7350119904076738,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736891917.073872,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "ifeval": "35b1a968304ce1d8fa21032567a89deea9b44fc4851893dea1a34179b20df314"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 602167.479468507,
+ "end_time": 602798.440833874,
+ "total_evaluation_time_seconds": "630.9613653670531"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/minerva_math_4_shot.json b/evaluations/en/Falcon3-7B-Instruct/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e4a618f10c1f8c44855d6882a2a6ee0ac9709db8
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/minerva_math_4_shot.json
@@ -0,0 +1,533 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.3076,
+ "exact_match_stderr,none": 0.006198998754660659,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.4026958719460826,
+ "exact_match_stderr,none": 0.014241115293724816
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.350210970464135,
+ "exact_match_stderr,none": 0.021934133893619426
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.3173277661795407,
+ "exact_match_stderr,none": 0.02128855620995171
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.09745293466223699,
+ "exact_match_stderr,none": 0.009874818485404377
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.24444444444444444,
+ "exact_match_stderr,none": 0.018510958396334234
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.5120551090700345,
+ "exact_match_stderr,none": 0.016946659873163027
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.1391941391941392,
+ "exact_match_stderr,none": 0.014827394112308778
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.3076,
+ "exact_match_stderr,none": 0.006198998754660659,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_precalc": {
+ "task": "minerva_math_precalc",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "precalculus",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "minerva_math": 1.0,
+ "minerva_math_algebra": 1.0,
+ "minerva_math_counting_and_prob": 1.0,
+ "minerva_math_geometry": 1.0,
+ "minerva_math_intermediate_algebra": 1.0,
+ "minerva_math_num_theory": 1.0,
+ "minerva_math_prealgebra": 1.0,
+ "minerva_math_precalc": 1.0
+ },
+ "n-shot": {
+ "minerva_math_algebra": 4,
+ "minerva_math_counting_and_prob": 4,
+ "minerva_math_geometry": 4,
+ "minerva_math_intermediate_algebra": 4,
+ "minerva_math_num_theory": 4,
+ "minerva_math_prealgebra": 4,
+ "minerva_math_precalc": 4
+ },
+ "higher_is_better": {
+ "minerva_math": {
+ "exact_match": true
+ },
+ "minerva_math_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_counting_and_prob": {
+ "exact_match": true
+ },
+ "minerva_math_geometry": {
+ "exact_match": true
+ },
+ "minerva_math_intermediate_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_num_theory": {
+ "exact_match": true
+ },
+ "minerva_math_prealgebra": {
+ "exact_match": true
+ },
+ "minerva_math_precalc": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "minerva_math_algebra": {
+ "original": 1187,
+ "effective": 1187
+ },
+ "minerva_math_counting_and_prob": {
+ "original": 474,
+ "effective": 474
+ },
+ "minerva_math_geometry": {
+ "original": 479,
+ "effective": 479
+ },
+ "minerva_math_intermediate_algebra": {
+ "original": 903,
+ "effective": 903
+ },
+ "minerva_math_num_theory": {
+ "original": 540,
+ "effective": 540
+ },
+ "minerva_math_prealgebra": {
+ "original": 871,
+ "effective": 871
+ },
+ "minerva_math_precalc": {
+ "original": 546,
+ "effective": 546
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736902050.8686402,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "minerva_math_algebra": "185f34b170fd1ddec0f7e4c6f6b46ec8e3634ad4c99d822a3e2f0a964a15f0d5",
+ "minerva_math_counting_and_prob": "7edba0e802d0ed4e586e3511f6cc4f7d369268a05835a9a4160e9c79236c0718",
+ "minerva_math_geometry": "a089b5ed647abeb1874a75b3212f265db6f797cb85a56c4ee8b6dcba00bb946f",
+ "minerva_math_intermediate_algebra": "1f523afc1e3a8ca005120f5c859d3ca68c7cc592bddc4d583eab99c076f188d1",
+ "minerva_math_num_theory": "0d8bdb3a26388da49d3e8d8419869655a3a3247dde250e368e44534cf5bba0ea",
+ "minerva_math_prealgebra": "27c50c162f003f7257958233b7e6501b6250cf8c580dda185ddc2f76ff9ae866",
+ "minerva_math_precalc": "1f27730753ee7cd62d6de902471a10a0adb5e0254b7d6014f56f459820aec022"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 612301.351469343,
+ "end_time": 616050.463408958,
+ "total_evaluation_time_seconds": "3749.1119396151043"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/mmlu_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/mmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1e093d251bf4f7e0916aa135cf51810c9c858aa1
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/mmlu_0_shot.json
@@ -0,0 +1,3345 @@
+{
+ "results": {
+ "mmlu": {
+ "acc,none": 0.6813844181740493,
+ "acc_stderr,none": 0.0036893340664510663,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5989373007438895,
+ "acc_stderr,none": 0.006561339743251598,
+ "alias": " - humanities"
+ },
+ "mmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.6031746031746031,
+ "acc_stderr,none": 0.0437588849272706
+ },
+ "mmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.8,
+ "acc_stderr,none": 0.031234752377721175
+ },
+ "mmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.8431372549019608,
+ "acc_stderr,none": 0.025524722324553332
+ },
+ "mmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.8565400843881856,
+ "acc_stderr,none": 0.022818291821017012
+ },
+ "mmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.8181818181818182,
+ "acc_stderr,none": 0.03520893951097654
+ },
+ "mmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.7962962962962963,
+ "acc_stderr,none": 0.03893542518824849
+ },
+ "mmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.7852760736196319,
+ "acc_stderr,none": 0.032262193772867744
+ },
+ "mmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.7398843930635838,
+ "acc_stderr,none": 0.023618678310069363
+ },
+ "mmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.2905027932960894,
+ "acc_stderr,none": 0.015183844307206155
+ },
+ "mmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.7684887459807074,
+ "acc_stderr,none": 0.023956532766639137
+ },
+ "mmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.7530864197530864,
+ "acc_stderr,none": 0.023993501709042117
+ },
+ "mmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.5097783572359843,
+ "acc_stderr,none": 0.012767793787729338
+ },
+ "mmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.8245614035087719,
+ "acc_stderr,none": 0.02917088550072766
+ },
+ "mmlu_other": {
+ "acc,none": 0.7219182491149019,
+ "acc_stderr,none": 0.007753178518309848,
+ "alias": " - other"
+ },
+ "mmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.04725815626252609
+ },
+ "mmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.7283018867924528,
+ "acc_stderr,none": 0.027377706624670713
+ },
+ "mmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.6473988439306358,
+ "acc_stderr,none": 0.036430371689585496
+ },
+ "mmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.049236596391733084
+ },
+ "mmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.7309417040358744,
+ "acc_stderr,none": 0.02976377940687497
+ },
+ "mmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.8252427184466019,
+ "acc_stderr,none": 0.037601780060266196
+ },
+ "mmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.8974358974358975,
+ "acc_stderr,none": 0.01987565502786744
+ },
+ "mmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.77,
+ "acc_stderr,none": 0.04229525846816502
+ },
+ "mmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.8237547892720306,
+ "acc_stderr,none": 0.01362555690799346
+ },
+ "mmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.7287581699346405,
+ "acc_stderr,none": 0.025457756696667864
+ },
+ "mmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.5354609929078015,
+ "acc_stderr,none": 0.02975238965742705
+ },
+ "mmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.7095588235294118,
+ "acc_stderr,none": 0.02757646862274052
+ },
+ "mmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.5060240963855421,
+ "acc_stderr,none": 0.03892212195333045
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.785830354241144,
+ "acc_stderr,none": 0.007242767358068179,
+ "alias": " - social sciences"
+ },
+ "mmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.5964912280701754,
+ "acc_stderr,none": 0.046151869625837054
+ },
+ "mmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.8181818181818182,
+ "acc_stderr,none": 0.0274796030105388
+ },
+ "mmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.8911917098445595,
+ "acc_stderr,none": 0.022473253332768766
+ },
+ "mmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.7307692307692307,
+ "acc_stderr,none": 0.022489389793654824
+ },
+ "mmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.8865546218487395,
+ "acc_stderr,none": 0.02060022575020482
+ },
+ "mmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.8844036697247707,
+ "acc_stderr,none": 0.01370874953417264
+ },
+ "mmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.7633587786259542,
+ "acc_stderr,none": 0.03727673575596915
+ },
+ "mmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.7124183006535948,
+ "acc_stderr,none": 0.018311653053648222
+ },
+ "mmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.6454545454545455,
+ "acc_stderr,none": 0.04582004841505415
+ },
+ "mmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.7183673469387755,
+ "acc_stderr,none": 0.02879518557429129
+ },
+ "mmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.8407960199004975,
+ "acc_stderr,none": 0.02587064676616914
+ },
+ "mmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.86,
+ "acc_stderr,none": 0.03487350880197768
+ },
+ "mmlu_stem": {
+ "acc,none": 0.6625436092610213,
+ "acc_stderr,none": 0.008110145398407284,
+ "alias": " - stem"
+ },
+ "mmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.49,
+ "acc_stderr,none": 0.05024183937956911
+ },
+ "mmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.6222222222222222,
+ "acc_stderr,none": 0.04188307537595853
+ },
+ "mmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.8026315789473685,
+ "acc_stderr,none": 0.03238981601699397
+ },
+ "mmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.8402777777777778,
+ "acc_stderr,none": 0.030635578972093274
+ },
+ "mmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.54,
+ "acc_stderr,none": 0.05009082659620333
+ },
+ "mmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.69,
+ "acc_stderr,none": 0.04648231987117316
+ },
+ "mmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.47,
+ "acc_stderr,none": 0.05016135580465919
+ },
+ "mmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.5196078431372549,
+ "acc_stderr,none": 0.04971358884367405
+ },
+ "mmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.79,
+ "acc_stderr,none": 0.040936018074033256
+ },
+ "mmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.7617021276595745,
+ "acc_stderr,none": 0.027851252973889788
+ },
+ "mmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.7379310344827587,
+ "acc_stderr,none": 0.036646663372252565
+ },
+ "mmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.6402116402116402,
+ "acc_stderr,none": 0.024718075944129274
+ },
+ "mmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.8419354838709677,
+ "acc_stderr,none": 0.02075283151187526
+ },
+ "mmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.6206896551724138,
+ "acc_stderr,none": 0.03413963805906235
+ },
+ "mmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.81,
+ "acc_stderr,none": 0.03942772444036623
+ },
+ "mmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.4,
+ "acc_stderr,none": 0.02986960509531691
+ },
+ "mmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.6423841059602649,
+ "acc_stderr,none": 0.03913453431177258
+ },
+ "mmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.6712962962962963,
+ "acc_stderr,none": 0.03203614084670058
+ },
+ "mmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.5803571428571429,
+ "acc_stderr,none": 0.046840993210771065
+ }
+ },
+ "groups": {
+ "mmlu": {
+ "acc,none": 0.6813844181740493,
+ "acc_stderr,none": 0.0036893340664510663,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.5989373007438895,
+ "acc_stderr,none": 0.006561339743251598,
+ "alias": " - humanities"
+ },
+ "mmlu_other": {
+ "acc,none": 0.7219182491149019,
+ "acc_stderr,none": 0.007753178518309848,
+ "alias": " - other"
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.785830354241144,
+ "acc_stderr,none": 0.007242767358068179,
+ "alias": " - social sciences"
+ },
+ "mmlu_stem": {
+ "acc,none": 0.6625436092610213,
+ "acc_stderr,none": 0.008110145398407284,
+ "alias": " - stem"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_humanities": [
+ "mmlu_jurisprudence",
+ "mmlu_international_law",
+ "mmlu_moral_scenarios",
+ "mmlu_philosophy",
+ "mmlu_high_school_world_history",
+ "mmlu_formal_logic",
+ "mmlu_high_school_us_history",
+ "mmlu_moral_disputes",
+ "mmlu_logical_fallacies",
+ "mmlu_high_school_european_history",
+ "mmlu_world_religions",
+ "mmlu_prehistory",
+ "mmlu_professional_law"
+ ],
+ "mmlu_social_sciences": [
+ "mmlu_human_sexuality",
+ "mmlu_high_school_psychology",
+ "mmlu_us_foreign_policy",
+ "mmlu_professional_psychology",
+ "mmlu_econometrics",
+ "mmlu_public_relations",
+ "mmlu_high_school_macroeconomics",
+ "mmlu_high_school_geography",
+ "mmlu_sociology",
+ "mmlu_high_school_government_and_politics",
+ "mmlu_security_studies",
+ "mmlu_high_school_microeconomics"
+ ],
+ "mmlu_other": [
+ "mmlu_miscellaneous",
+ "mmlu_professional_medicine",
+ "mmlu_marketing",
+ "mmlu_business_ethics",
+ "mmlu_clinical_knowledge",
+ "mmlu_human_aging",
+ "mmlu_professional_accounting",
+ "mmlu_medical_genetics",
+ "mmlu_college_medicine",
+ "mmlu_virology",
+ "mmlu_nutrition",
+ "mmlu_management",
+ "mmlu_global_facts"
+ ],
+ "mmlu_stem": [
+ "mmlu_elementary_mathematics",
+ "mmlu_electrical_engineering",
+ "mmlu_college_mathematics",
+ "mmlu_machine_learning",
+ "mmlu_high_school_physics",
+ "mmlu_high_school_biology",
+ "mmlu_abstract_algebra",
+ "mmlu_college_biology",
+ "mmlu_college_physics",
+ "mmlu_computer_security",
+ "mmlu_college_computer_science",
+ "mmlu_high_school_chemistry",
+ "mmlu_high_school_computer_science",
+ "mmlu_conceptual_physics",
+ "mmlu_high_school_statistics",
+ "mmlu_college_chemistry",
+ "mmlu_astronomy",
+ "mmlu_anatomy",
+ "mmlu_high_school_mathematics"
+ ],
+ "mmlu": [
+ "mmlu_stem",
+ "mmlu_other",
+ "mmlu_social_sciences",
+ "mmlu_humanities"
+ ]
+ },
+ "configs": {
+ "mmlu_abstract_algebra": {
+ "task": "mmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "abstract_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about abstract algebra.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_anatomy": {
+ "task": "mmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "anatomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about anatomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_astronomy": {
+ "task": "mmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "astronomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about astronomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_business_ethics": {
+ "task": "mmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "business_ethics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about business ethics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_clinical_knowledge": {
+ "task": "mmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "clinical_knowledge",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about clinical knowledge.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_biology": {
+ "task": "mmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_chemistry": {
+ "task": "mmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_computer_science": {
+ "task": "mmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_mathematics": {
+ "task": "mmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_medicine": {
+ "task": "mmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_physics": {
+ "task": "mmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_computer_security": {
+ "task": "mmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "computer_security",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about computer security.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_conceptual_physics": {
+ "task": "mmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "conceptual_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about conceptual physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_econometrics": {
+ "task": "mmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "econometrics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about econometrics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_electrical_engineering": {
+ "task": "mmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "electrical_engineering",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about electrical engineering.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_elementary_mathematics": {
+ "task": "mmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "elementary_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about elementary mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_formal_logic": {
+ "task": "mmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "formal_logic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about formal logic.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_global_facts": {
+ "task": "mmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "global_facts",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about global facts.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_biology": {
+ "task": "mmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_chemistry": {
+ "task": "mmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_computer_science": {
+ "task": "mmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_european_history": {
+ "task": "mmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_european_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school european history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_geography": {
+ "task": "mmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_geography",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school geography.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_government_and_politics": {
+ "task": "mmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_government_and_politics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school government and politics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_macroeconomics": {
+ "task": "mmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_macroeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school macroeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_mathematics": {
+ "task": "mmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_microeconomics": {
+ "task": "mmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_microeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school microeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_physics": {
+ "task": "mmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_psychology": {
+ "task": "mmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_statistics": {
+ "task": "mmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_statistics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school statistics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_us_history": {
+ "task": "mmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_us_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school us history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_world_history": {
+ "task": "mmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_world_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school world history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_aging": {
+ "task": "mmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_aging",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human aging.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_sexuality": {
+ "task": "mmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_sexuality",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human sexuality.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_international_law": {
+ "task": "mmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "international_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about international law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_jurisprudence": {
+ "task": "mmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "jurisprudence",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about jurisprudence.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_logical_fallacies": {
+ "task": "mmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "logical_fallacies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about logical fallacies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_machine_learning": {
+ "task": "mmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "machine_learning",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about machine learning.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_management": {
+ "task": "mmlu_management",
+ "task_alias": "management",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "management",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about management.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_marketing": {
+ "task": "mmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "marketing",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about marketing.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_medical_genetics": {
+ "task": "mmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "medical_genetics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about medical genetics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_miscellaneous": {
+ "task": "mmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "miscellaneous",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about miscellaneous.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_disputes": {
+ "task": "mmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_disputes",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral disputes.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_scenarios": {
+ "task": "mmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_scenarios",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral scenarios.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_nutrition": {
+ "task": "mmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "nutrition",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about nutrition.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_philosophy": {
+ "task": "mmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "philosophy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about philosophy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_prehistory": {
+ "task": "mmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "prehistory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about prehistory.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_accounting": {
+ "task": "mmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_accounting",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional accounting.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_law": {
+ "task": "mmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_medicine": {
+ "task": "mmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_psychology": {
+ "task": "mmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_public_relations": {
+ "task": "mmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "public_relations",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about public relations.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_security_studies": {
+ "task": "mmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "security_studies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about security studies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_sociology": {
+ "task": "mmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "sociology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about sociology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_us_foreign_policy": {
+ "task": "mmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "us_foreign_policy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about us foreign policy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_virology": {
+ "task": "mmlu_virology",
+ "task_alias": "virology",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "virology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about virology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_world_religions": {
+ "task": "mmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "world_religions",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about world religions.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu": 2,
+ "mmlu_abstract_algebra": 1.0,
+ "mmlu_anatomy": 1.0,
+ "mmlu_astronomy": 1.0,
+ "mmlu_business_ethics": 1.0,
+ "mmlu_clinical_knowledge": 1.0,
+ "mmlu_college_biology": 1.0,
+ "mmlu_college_chemistry": 1.0,
+ "mmlu_college_computer_science": 1.0,
+ "mmlu_college_mathematics": 1.0,
+ "mmlu_college_medicine": 1.0,
+ "mmlu_college_physics": 1.0,
+ "mmlu_computer_security": 1.0,
+ "mmlu_conceptual_physics": 1.0,
+ "mmlu_econometrics": 1.0,
+ "mmlu_electrical_engineering": 1.0,
+ "mmlu_elementary_mathematics": 1.0,
+ "mmlu_formal_logic": 1.0,
+ "mmlu_global_facts": 1.0,
+ "mmlu_high_school_biology": 1.0,
+ "mmlu_high_school_chemistry": 1.0,
+ "mmlu_high_school_computer_science": 1.0,
+ "mmlu_high_school_european_history": 1.0,
+ "mmlu_high_school_geography": 1.0,
+ "mmlu_high_school_government_and_politics": 1.0,
+ "mmlu_high_school_macroeconomics": 1.0,
+ "mmlu_high_school_mathematics": 1.0,
+ "mmlu_high_school_microeconomics": 1.0,
+ "mmlu_high_school_physics": 1.0,
+ "mmlu_high_school_psychology": 1.0,
+ "mmlu_high_school_statistics": 1.0,
+ "mmlu_high_school_us_history": 1.0,
+ "mmlu_high_school_world_history": 1.0,
+ "mmlu_human_aging": 1.0,
+ "mmlu_human_sexuality": 1.0,
+ "mmlu_humanities": 2,
+ "mmlu_international_law": 1.0,
+ "mmlu_jurisprudence": 1.0,
+ "mmlu_logical_fallacies": 1.0,
+ "mmlu_machine_learning": 1.0,
+ "mmlu_management": 1.0,
+ "mmlu_marketing": 1.0,
+ "mmlu_medical_genetics": 1.0,
+ "mmlu_miscellaneous": 1.0,
+ "mmlu_moral_disputes": 1.0,
+ "mmlu_moral_scenarios": 1.0,
+ "mmlu_nutrition": 1.0,
+ "mmlu_other": 2,
+ "mmlu_philosophy": 1.0,
+ "mmlu_prehistory": 1.0,
+ "mmlu_professional_accounting": 1.0,
+ "mmlu_professional_law": 1.0,
+ "mmlu_professional_medicine": 1.0,
+ "mmlu_professional_psychology": 1.0,
+ "mmlu_public_relations": 1.0,
+ "mmlu_security_studies": 1.0,
+ "mmlu_social_sciences": 2,
+ "mmlu_sociology": 1.0,
+ "mmlu_stem": 2,
+ "mmlu_us_foreign_policy": 1.0,
+ "mmlu_virology": 1.0,
+ "mmlu_world_religions": 1.0
+ },
+ "n-shot": {
+ "mmlu_abstract_algebra": 0,
+ "mmlu_anatomy": 0,
+ "mmlu_astronomy": 0,
+ "mmlu_business_ethics": 0,
+ "mmlu_clinical_knowledge": 0,
+ "mmlu_college_biology": 0,
+ "mmlu_college_chemistry": 0,
+ "mmlu_college_computer_science": 0,
+ "mmlu_college_mathematics": 0,
+ "mmlu_college_medicine": 0,
+ "mmlu_college_physics": 0,
+ "mmlu_computer_security": 0,
+ "mmlu_conceptual_physics": 0,
+ "mmlu_econometrics": 0,
+ "mmlu_electrical_engineering": 0,
+ "mmlu_elementary_mathematics": 0,
+ "mmlu_formal_logic": 0,
+ "mmlu_global_facts": 0,
+ "mmlu_high_school_biology": 0,
+ "mmlu_high_school_chemistry": 0,
+ "mmlu_high_school_computer_science": 0,
+ "mmlu_high_school_european_history": 0,
+ "mmlu_high_school_geography": 0,
+ "mmlu_high_school_government_and_politics": 0,
+ "mmlu_high_school_macroeconomics": 0,
+ "mmlu_high_school_mathematics": 0,
+ "mmlu_high_school_microeconomics": 0,
+ "mmlu_high_school_physics": 0,
+ "mmlu_high_school_psychology": 0,
+ "mmlu_high_school_statistics": 0,
+ "mmlu_high_school_us_history": 0,
+ "mmlu_high_school_world_history": 0,
+ "mmlu_human_aging": 0,
+ "mmlu_human_sexuality": 0,
+ "mmlu_international_law": 0,
+ "mmlu_jurisprudence": 0,
+ "mmlu_logical_fallacies": 0,
+ "mmlu_machine_learning": 0,
+ "mmlu_management": 0,
+ "mmlu_marketing": 0,
+ "mmlu_medical_genetics": 0,
+ "mmlu_miscellaneous": 0,
+ "mmlu_moral_disputes": 0,
+ "mmlu_moral_scenarios": 0,
+ "mmlu_nutrition": 0,
+ "mmlu_philosophy": 0,
+ "mmlu_prehistory": 0,
+ "mmlu_professional_accounting": 0,
+ "mmlu_professional_law": 0,
+ "mmlu_professional_medicine": 0,
+ "mmlu_professional_psychology": 0,
+ "mmlu_public_relations": 0,
+ "mmlu_security_studies": 0,
+ "mmlu_sociology": 0,
+ "mmlu_us_foreign_policy": 0,
+ "mmlu_virology": 0,
+ "mmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "mmlu": {
+ "acc": true
+ },
+ "mmlu_abstract_algebra": {
+ "acc": true
+ },
+ "mmlu_anatomy": {
+ "acc": true
+ },
+ "mmlu_astronomy": {
+ "acc": true
+ },
+ "mmlu_business_ethics": {
+ "acc": true
+ },
+ "mmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "mmlu_college_biology": {
+ "acc": true
+ },
+ "mmlu_college_chemistry": {
+ "acc": true
+ },
+ "mmlu_college_computer_science": {
+ "acc": true
+ },
+ "mmlu_college_mathematics": {
+ "acc": true
+ },
+ "mmlu_college_medicine": {
+ "acc": true
+ },
+ "mmlu_college_physics": {
+ "acc": true
+ },
+ "mmlu_computer_security": {
+ "acc": true
+ },
+ "mmlu_conceptual_physics": {
+ "acc": true
+ },
+ "mmlu_econometrics": {
+ "acc": true
+ },
+ "mmlu_electrical_engineering": {
+ "acc": true
+ },
+ "mmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "mmlu_formal_logic": {
+ "acc": true
+ },
+ "mmlu_global_facts": {
+ "acc": true
+ },
+ "mmlu_high_school_biology": {
+ "acc": true
+ },
+ "mmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "mmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "mmlu_high_school_european_history": {
+ "acc": true
+ },
+ "mmlu_high_school_geography": {
+ "acc": true
+ },
+ "mmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "mmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "mmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_physics": {
+ "acc": true
+ },
+ "mmlu_high_school_psychology": {
+ "acc": true
+ },
+ "mmlu_high_school_statistics": {
+ "acc": true
+ },
+ "mmlu_high_school_us_history": {
+ "acc": true
+ },
+ "mmlu_high_school_world_history": {
+ "acc": true
+ },
+ "mmlu_human_aging": {
+ "acc": true
+ },
+ "mmlu_human_sexuality": {
+ "acc": true
+ },
+ "mmlu_humanities": {
+ "acc": true
+ },
+ "mmlu_international_law": {
+ "acc": true
+ },
+ "mmlu_jurisprudence": {
+ "acc": true
+ },
+ "mmlu_logical_fallacies": {
+ "acc": true
+ },
+ "mmlu_machine_learning": {
+ "acc": true
+ },
+ "mmlu_management": {
+ "acc": true
+ },
+ "mmlu_marketing": {
+ "acc": true
+ },
+ "mmlu_medical_genetics": {
+ "acc": true
+ },
+ "mmlu_miscellaneous": {
+ "acc": true
+ },
+ "mmlu_moral_disputes": {
+ "acc": true
+ },
+ "mmlu_moral_scenarios": {
+ "acc": true
+ },
+ "mmlu_nutrition": {
+ "acc": true
+ },
+ "mmlu_other": {
+ "acc": true
+ },
+ "mmlu_philosophy": {
+ "acc": true
+ },
+ "mmlu_prehistory": {
+ "acc": true
+ },
+ "mmlu_professional_accounting": {
+ "acc": true
+ },
+ "mmlu_professional_law": {
+ "acc": true
+ },
+ "mmlu_professional_medicine": {
+ "acc": true
+ },
+ "mmlu_professional_psychology": {
+ "acc": true
+ },
+ "mmlu_public_relations": {
+ "acc": true
+ },
+ "mmlu_security_studies": {
+ "acc": true
+ },
+ "mmlu_social_sciences": {
+ "acc": true
+ },
+ "mmlu_sociology": {
+ "acc": true
+ },
+ "mmlu_stem": {
+ "acc": true
+ },
+ "mmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "mmlu_virology": {
+ "acc": true
+ },
+ "mmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "mmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "mmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "mmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "mmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "mmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "mmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "mmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "mmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "mmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "mmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "mmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "mmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "mmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "mmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "mmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "mmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "mmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "mmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "mmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "mmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "mmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "mmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "mmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "mmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "mmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "mmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "mmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "mmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "mmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "mmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "mmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "mmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "mmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "mmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "mmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ },
+ "mmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "mmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "mmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "mmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "mmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "mmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "mmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "mmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "mmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "mmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "mmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "mmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736901843.8252811,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "mmlu_elementary_mathematics": "6d47e01621b1ff088cf4d2606be08a46ae4fa10d2bf3529bd5a0f85d2832e0f6",
+ "mmlu_electrical_engineering": "ef25c57c137bd2074c388edf889ea1a658e5a3afd3921887a6bdbe8b1cbdfc0f",
+ "mmlu_college_mathematics": "118ed98b6c4bd806f93efddf09a3041a5128e8d4582b9fb7fe12f1a1ae38ecf4",
+ "mmlu_machine_learning": "edba86c924c71abf5cc3c004d972c140f22bfabaa70041d3b8ae287866a9ce49",
+ "mmlu_high_school_physics": "51bae6e0d59010099d6b490c5740b24713b5e66662e552aa4698a662bbf8b628",
+ "mmlu_high_school_biology": "d99da3dd9a02094ae6e812eb30893f1b56ee748bf2ce91769728790f49a526b6",
+ "mmlu_abstract_algebra": "c63adb6be5bfb9380a7f822a05102e469983e4522ce2fccfb05dc3ebb618c36c",
+ "mmlu_college_biology": "ed93aba6c7bd7762a8eec5ce4b23c31549e52ced85fa75024d5996542518961b",
+ "mmlu_college_physics": "2cd501daecd35dbcfb2d3338cf04960dfdb8789384b7af321ddf480a4bb293e3",
+ "mmlu_computer_security": "adb17543d486c98e2c258c0b6450cf80889cfecbb204c658a88c375408a2d5ec",
+ "mmlu_college_computer_science": "eef39460f59676420a6cd82b21f0a338b0afbc17f6759e2e6ee9164ba6dda170",
+ "mmlu_high_school_chemistry": "6a0d95898c301509675c6c09024f1cfa75dfb7dd9c15709dc35428923b87c454",
+ "mmlu_high_school_computer_science": "005460140c49df97c405dee883789e0fc8e2747ce74f7eacd692e429e732b0b5",
+ "mmlu_conceptual_physics": "5eb25b75add800a0b85e7b69406dee40f20de3cd9f29c09fa65d59768449b729",
+ "mmlu_high_school_statistics": "7600e8753249d21170484a51da34e671ff61d837a4f4b7b92e763f04c178b4ba",
+ "mmlu_college_chemistry": "4793edf2d734030e6b49c443a4cfda8d2f2e34c9baa9112b9adb1cf79ba58bcf",
+ "mmlu_astronomy": "bb5d9f011ccdeeb9e89210e2c88fb2702d535c896dc8a544534ce19a77bdd40c",
+ "mmlu_anatomy": "f168b80d22fd964a0ea802808d94cdbf5cae82224e3d3602cc5ff912c366e1b3",
+ "mmlu_high_school_mathematics": "321f1383949b54f2f51402b09925541b2e8a171359ad8fb0433c5d99b9674595",
+ "mmlu_miscellaneous": "4c6d23e098aad1d79cdc6d956b8d66c3ca00003de07bd75300b870e9bf2ee253",
+ "mmlu_professional_medicine": "56b70c1334dacf62b62d5a21f32d30c640a6afb1522994c2884b411f6c4a9a0b",
+ "mmlu_marketing": "0134f11131a3a629c50102643862ebdd6acb617752938261b903ddb8afc40eba",
+ "mmlu_business_ethics": "3e5ad06da30b6bb600036f7ff0202a5a2d06c0803223dcf8873f5f5782892f7d",
+ "mmlu_clinical_knowledge": "3706b2cfd1a90b62b864d1534911d194afc384afb660563879d79e184e8cf3d4",
+ "mmlu_human_aging": "e97889b26bd5d7b0a80e0d167ca12b7ae771d6b7359f6d780fa7fd98f4dadcec",
+ "mmlu_professional_accounting": "7b38be5f62b6529524748f3a418444f8eaf77f17dcf40ed03a448118ec8b0f8f",
+ "mmlu_medical_genetics": "e2ba83d6fbd06d87b8311a7dff3b336a6c89c3686652b3932c7ab46b384552e0",
+ "mmlu_college_medicine": "971339e961cc8efc075c31d29cbc8f1a9834586160b0c5f46ff8b276afd0eec2",
+ "mmlu_virology": "58b8f73b5103985889402935e2b0ffbf1a11b295b801d07c44ee752350de5d99",
+ "mmlu_nutrition": "c6001266b538b2cdf473e816a2bcfeba547f03782c5bb0ad8804a2e1f97ea101",
+ "mmlu_management": "22ca56010a69657348db8209d89abbbd12516ce3d196999d223a5ec0f0a5fa8d",
+ "mmlu_global_facts": "0fecc8ba2c707eb82bbcbc7c59231aa56bf199d6241ea66486b4890f7c5a3769",
+ "mmlu_human_sexuality": "c3952ead23515a5207cf9f3100720f2e7e87afd423707745440088945f8652fb",
+ "mmlu_high_school_psychology": "fd2aba1beecb388fa7ac1516f3f164a8d4dfc003f1853302a0880b1f8fa98b69",
+ "mmlu_us_foreign_policy": "6687777c37a19360984ee099dbf3f398c1167e24f61e7a4144186493a5fcca8e",
+ "mmlu_professional_psychology": "8a0ad36605f937eecc2fb585d0b028799b532d91ba4635cac27c4edb64983588",
+ "mmlu_econometrics": "653c77934b037d0f9161ec45aaa98289aa3c5bc21b168f53f500afb0e2558de8",
+ "mmlu_public_relations": "4ab2f842b7193f7772b86b93907ae5e95602e1d0ab4d34bd8ffcd90eb636749a",
+ "mmlu_high_school_macroeconomics": "9cb4eb0918a560ad4eb14644e75098ceb31fb47c2ddcb3d5cd0cfb453f42943f",
+ "mmlu_high_school_geography": "1a7250b1bc9da6c95e32a1355cbfb55eafec79205473a02dd4e5b2dca62ee8b5",
+ "mmlu_sociology": "94c24d5267dc4641df7050f706238d02da6bd59c9d13308b91f6f3e2e3c766df",
+ "mmlu_high_school_government_and_politics": "fcb0e289d3d0b54c0dfd0d617a4e62181dfad12416a204d72d841fd4a99b8d9e",
+ "mmlu_security_studies": "a17e8fdfdda63b0f637ee0708501ecf5726cb76e4202b1fd79caab408ee2643d",
+ "mmlu_high_school_microeconomics": "383542db869a76d567e7c38637673e1b793c9b50b12fa9b0f65f68148a11787f",
+ "mmlu_jurisprudence": "d1324a2503964003b6f8f1e2f0245f1119c12dd113203ad292736bac9a91a350",
+ "mmlu_international_law": "38a92f06a96a87e69e12e82169bb7bd6f10f6b8adc61be20a9c68c0469d1d33d",
+ "mmlu_moral_scenarios": "729862e143b7bdaeaaf8169163162bd57c908d073ce7ea91737b605456026ed0",
+ "mmlu_philosophy": "763992eefbcda260efa16ebc995f09d244a6c8de4d61cb42ee1d7a9c5ca39543",
+ "mmlu_high_school_world_history": "5b4e5fc132b2d94b43add2e24e3f7284551a8be325948d6bcbb71c9f6bc2392c",
+ "mmlu_formal_logic": "fa096943ff3545d7d2fc3ac78194a0c1f352444e866511eb7737f06fbc8a7c9c",
+ "mmlu_high_school_us_history": "15ba64945d9a5fcf19245da1fb2663f9dedfeeb57f5515d37819f5de22e66a07",
+ "mmlu_moral_disputes": "39c141acc54f689a80e10e8615e1f62d581f09098edde4d389b1c13e92d4b49f",
+ "mmlu_logical_fallacies": "79ae47f5687483604531efbfd296a1edfa2a55facce333d43223b4a8fdf8780b",
+ "mmlu_high_school_european_history": "9d566a9a0b4521a56e56da75853682cbf6bee3f508101ae30e9516f2a1b42a15",
+ "mmlu_world_religions": "f8ec050ecd0217b3f863b199b03792909c78f6daee67ec5018d8f3ef92ccfd83",
+ "mmlu_prehistory": "cf0233bf3e56c9e67668dac16aed89d1721a87edb1456c4168493459ec3e4b28",
+ "mmlu_professional_law": "80161dc5f1a2d756815ce70fa33c5846e5b326aeb46b6fdccaa05a91a34a3c05"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 612094.256100895,
+ "end_time": 612237.200732146,
+ "total_evaluation_time_seconds": "142.94463125104085"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/mmlu_pro_5_shot.json b/evaluations/en/Falcon3-7B-Instruct/mmlu_pro_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..83e9cd69dab887c3b8a9c52ba6f78918e7aa3c33
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/mmlu_pro_5_shot.json
@@ -0,0 +1,1107 @@
+{
+ "results": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.46725398936170215,
+ "exact_match_stderr,custom-extract": 0.004446206414113066,
+ "alias": "mmlu_pro"
+ },
+ "mmlu_pro_biology": {
+ "alias": " - biology",
+ "exact_match,custom-extract": 0.6875871687587168,
+ "exact_match_stderr,custom-extract": 0.017320953747153173
+ },
+ "mmlu_pro_business": {
+ "alias": " - business",
+ "exact_match,custom-extract": 0.49936628643852976,
+ "exact_match_stderr,custom-extract": 0.01781174819081783
+ },
+ "mmlu_pro_chemistry": {
+ "alias": " - chemistry",
+ "exact_match,custom-extract": 0.39752650176678445,
+ "exact_match_stderr,custom-extract": 0.014551933952245952
+ },
+ "mmlu_pro_computer_science": {
+ "alias": " - computer_science",
+ "exact_match,custom-extract": 0.5048780487804878,
+ "exact_match_stderr,custom-extract": 0.024722232188886337
+ },
+ "mmlu_pro_economics": {
+ "alias": " - economics",
+ "exact_match,custom-extract": 0.6196682464454977,
+ "exact_match_stderr,custom-extract": 0.016720417860194965
+ },
+ "mmlu_pro_engineering": {
+ "alias": " - engineering",
+ "exact_match,custom-extract": 0.3323013415892673,
+ "exact_match_stderr,custom-extract": 0.015139747095474023
+ },
+ "mmlu_pro_health": {
+ "alias": " - health",
+ "exact_match,custom-extract": 0.511002444987775,
+ "exact_match_stderr,custom-extract": 0.01748855006451323
+ },
+ "mmlu_pro_history": {
+ "alias": " - history",
+ "exact_match,custom-extract": 0.4330708661417323,
+ "exact_match_stderr,custom-extract": 0.02541862615034512
+ },
+ "mmlu_pro_law": {
+ "alias": " - law",
+ "exact_match,custom-extract": 0.28701180744777477,
+ "exact_match_stderr,custom-extract": 0.01363938247846805
+ },
+ "mmlu_pro_math": {
+ "alias": " - math",
+ "exact_match,custom-extract": 0.47964470762398226,
+ "exact_match_stderr,custom-extract": 0.013596994822448527
+ },
+ "mmlu_pro_other": {
+ "alias": " - other",
+ "exact_match,custom-extract": 0.44696969696969696,
+ "exact_match_stderr,custom-extract": 0.016364873559887708
+ },
+ "mmlu_pro_philosophy": {
+ "alias": " - philosophy",
+ "exact_match,custom-extract": 0.4188376753507014,
+ "exact_match_stderr,custom-extract": 0.022108380221516063
+ },
+ "mmlu_pro_physics": {
+ "alias": " - physics",
+ "exact_match,custom-extract": 0.44187836797536567,
+ "exact_match_stderr,custom-extract": 0.0137841011754968
+ },
+ "mmlu_pro_psychology": {
+ "alias": " - psychology",
+ "exact_match,custom-extract": 0.6140350877192983,
+ "exact_match_stderr,custom-extract": 0.017244132301501423
+ }
+ },
+ "groups": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.46725398936170215,
+ "exact_match_stderr,custom-extract": 0.004446206414113066,
+ "alias": "mmlu_pro"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_pro": [
+ "mmlu_pro_biology",
+ "mmlu_pro_business",
+ "mmlu_pro_chemistry",
+ "mmlu_pro_computer_science",
+ "mmlu_pro_economics",
+ "mmlu_pro_engineering",
+ "mmlu_pro_health",
+ "mmlu_pro_history",
+ "mmlu_pro_law",
+ "mmlu_pro_math",
+ "mmlu_pro_other",
+ "mmlu_pro_philosophy",
+ "mmlu_pro_physics",
+ "mmlu_pro_psychology"
+ ]
+ },
+ "configs": {
+ "mmlu_pro_biology": {
+ "task": "mmlu_pro_biology",
+ "task_alias": "biology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='biology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_business": {
+ "task": "mmlu_pro_business",
+ "task_alias": "business",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='business')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_chemistry": {
+ "task": "mmlu_pro_chemistry",
+ "task_alias": "chemistry",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='chemistry')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_computer_science": {
+ "task": "mmlu_pro_computer_science",
+ "task_alias": "computer_science",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='computer science')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_economics": {
+ "task": "mmlu_pro_economics",
+ "task_alias": "economics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='economics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_engineering": {
+ "task": "mmlu_pro_engineering",
+ "task_alias": "engineering",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='engineering')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_health": {
+ "task": "mmlu_pro_health",
+ "task_alias": "health",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='health')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_history": {
+ "task": "mmlu_pro_history",
+ "task_alias": "history",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='history')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_law": {
+ "task": "mmlu_pro_law",
+ "task_alias": "law",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='law')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_math": {
+ "task": "mmlu_pro_math",
+ "task_alias": "math",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='math')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_other": {
+ "task": "mmlu_pro_other",
+ "task_alias": "other",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='other')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_philosophy": {
+ "task": "mmlu_pro_philosophy",
+ "task_alias": "philosophy",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='philosophy')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_physics": {
+ "task": "mmlu_pro_physics",
+ "task_alias": "physics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='physics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_psychology": {
+ "task": "mmlu_pro_psychology",
+ "task_alias": "psychology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='psychology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu_pro": 2.0,
+ "mmlu_pro_biology": 1.0,
+ "mmlu_pro_business": 1.0,
+ "mmlu_pro_chemistry": 1.0,
+ "mmlu_pro_computer_science": 1.0,
+ "mmlu_pro_economics": 1.0,
+ "mmlu_pro_engineering": 1.0,
+ "mmlu_pro_health": 1.0,
+ "mmlu_pro_history": 1.0,
+ "mmlu_pro_law": 1.0,
+ "mmlu_pro_math": 1.0,
+ "mmlu_pro_other": 1.0,
+ "mmlu_pro_philosophy": 1.0,
+ "mmlu_pro_physics": 1.0,
+ "mmlu_pro_psychology": 1.0
+ },
+ "n-shot": {
+ "mmlu_pro_biology": 5,
+ "mmlu_pro_business": 5,
+ "mmlu_pro_chemistry": 5,
+ "mmlu_pro_computer_science": 5,
+ "mmlu_pro_economics": 5,
+ "mmlu_pro_engineering": 5,
+ "mmlu_pro_health": 5,
+ "mmlu_pro_history": 5,
+ "mmlu_pro_law": 5,
+ "mmlu_pro_math": 5,
+ "mmlu_pro_other": 5,
+ "mmlu_pro_philosophy": 5,
+ "mmlu_pro_physics": 5,
+ "mmlu_pro_psychology": 5
+ },
+ "higher_is_better": {
+ "mmlu_pro": {
+ "exact_match": true
+ },
+ "mmlu_pro_biology": {
+ "exact_match": true
+ },
+ "mmlu_pro_business": {
+ "exact_match": true
+ },
+ "mmlu_pro_chemistry": {
+ "exact_match": true
+ },
+ "mmlu_pro_computer_science": {
+ "exact_match": true
+ },
+ "mmlu_pro_economics": {
+ "exact_match": true
+ },
+ "mmlu_pro_engineering": {
+ "exact_match": true
+ },
+ "mmlu_pro_health": {
+ "exact_match": true
+ },
+ "mmlu_pro_history": {
+ "exact_match": true
+ },
+ "mmlu_pro_law": {
+ "exact_match": true
+ },
+ "mmlu_pro_math": {
+ "exact_match": true
+ },
+ "mmlu_pro_other": {
+ "exact_match": true
+ },
+ "mmlu_pro_philosophy": {
+ "exact_match": true
+ },
+ "mmlu_pro_physics": {
+ "exact_match": true
+ },
+ "mmlu_pro_psychology": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "mmlu_pro_biology": {
+ "original": 717,
+ "effective": 717
+ },
+ "mmlu_pro_business": {
+ "original": 789,
+ "effective": 789
+ },
+ "mmlu_pro_chemistry": {
+ "original": 1132,
+ "effective": 1132
+ },
+ "mmlu_pro_computer_science": {
+ "original": 410,
+ "effective": 410
+ },
+ "mmlu_pro_economics": {
+ "original": 844,
+ "effective": 844
+ },
+ "mmlu_pro_engineering": {
+ "original": 969,
+ "effective": 969
+ },
+ "mmlu_pro_health": {
+ "original": 818,
+ "effective": 818
+ },
+ "mmlu_pro_history": {
+ "original": 381,
+ "effective": 381
+ },
+ "mmlu_pro_law": {
+ "original": 1101,
+ "effective": 1101
+ },
+ "mmlu_pro_math": {
+ "original": 1351,
+ "effective": 1351
+ },
+ "mmlu_pro_other": {
+ "original": 924,
+ "effective": 924
+ },
+ "mmlu_pro_philosophy": {
+ "original": 499,
+ "effective": 499
+ },
+ "mmlu_pro_physics": {
+ "original": 1299,
+ "effective": 1299
+ },
+ "mmlu_pro_psychology": {
+ "original": 798,
+ "effective": 798
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736893005.852345,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "mmlu_pro_biology": "16c809c3bd9835d58bf3bb74c36233a66ca3d224c1803edea22535e4ce7f4360",
+ "mmlu_pro_business": "c99f593bf18979b611b09ba00bc09ddc3e6b76a9fb1365f10db568ee193ba0c5",
+ "mmlu_pro_chemistry": "a6d38cdf1b84c5029fbe448996bf9fd76a5a927e51232c37746d8412322454cf",
+ "mmlu_pro_computer_science": "de9beede284a884bf478f2f7951055c84310888ba3c289d3bf3f23b8f82ffdbd",
+ "mmlu_pro_economics": "52a942261bdfa4bf43fb807fb973ab258212d3cfddb90fd3cb372792836ec4af",
+ "mmlu_pro_engineering": "0fa251c32b4985125d200a30064e5603a692eedf41c2a3237bf74fed2e4fec50",
+ "mmlu_pro_health": "d57f24fcf156f9faede5cae1af17049dfcbeb85797159cf455c92fe7c12cfc27",
+ "mmlu_pro_history": "5647ea5af92de86f57a6349d9373b236002e27846d989e47401718df7314761b",
+ "mmlu_pro_law": "139898ce0780bc8c88459432881047531e551058c5de9a2d7d412ce3329f453c",
+ "mmlu_pro_math": "813806899ea8b2e09dadefc338b26fbd8ae32cdd17737f0f2453edf83fb40506",
+ "mmlu_pro_other": "cf7b99863728afeacc66b0ed950bf83b9e4d282d7f431a57a96afe4347f2a074",
+ "mmlu_pro_philosophy": "d508069b7725cb21a85aeb05142545ab9a466aaba25a8fe6d42d043835f5da99",
+ "mmlu_pro_physics": "0a0ae7da16f00ff27793e2fc3a379eab1ebc4faa0099fb221a263bdb47f88e00",
+ "mmlu_pro_psychology": "00bc092b5f69c4600e2ae60b25be8af5778d5277c29feece216538d2d67005ba"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 603256.080151306,
+ "end_time": 607397.753945536,
+ "total_evaluation_time_seconds": "4141.673794229981"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/triviaqa_5_shot.json b/evaluations/en/Falcon3-7B-Instruct/triviaqa_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..d4be329a8455ca851b365dfbb7968740a5d3462c
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/triviaqa_5_shot.json
@@ -0,0 +1,134 @@
+{
+ "results": {
+ "triviaqa": {
+ "alias": "triviaqa",
+ "exact_match,remove_whitespace": 0.5197837717342845,
+ "exact_match_stderr,remove_whitespace": 0.003729771668524104
+ }
+ },
+ "group_subtasks": {
+ "triviaqa": []
+ },
+ "configs": {
+ "triviaqa": {
+ "task": "triviaqa",
+ "dataset_path": "trivia_qa",
+ "dataset_name": "rc.nocontext",
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "Question: {{question}}?\nAnswer:",
+ "doc_to_target": "{{answer.aliases}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "\n",
+ ".",
+ ","
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "remove_whitespace",
+ "filter": [
+ {
+ "function": "remove_whitespace"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "triviaqa": 3.0
+ },
+ "n-shot": {
+ "triviaqa": 5
+ },
+ "higher_is_better": {
+ "triviaqa": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "triviaqa": {
+ "original": 17944,
+ "effective": 17944
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736892612.7161763,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "triviaqa": "670d2ae10dd71aa794fbdf7ab8e87b2005e2dda265045033795fd65031df1ea4"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 602862.940441801,
+ "end_time": 603179.077445082,
+ "total_evaluation_time_seconds": "316.1370032810373"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/truthfulqa_mc2_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/truthfulqa_mc2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..1f90902eba8c29c602a24b45b154f4b64d0ab4cb
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/truthfulqa_mc2_0_shot.json
@@ -0,0 +1,114 @@
+{
+ "results": {
+ "truthfulqa_mc2": {
+ "alias": "truthfulqa_mc2",
+ "acc,none": 0.5553251876617251,
+ "acc_stderr,none": 0.01592232780967959
+ }
+ },
+ "group_subtasks": {
+ "truthfulqa_mc2": []
+ },
+ "configs": {
+ "truthfulqa_mc2": {
+ "task": "truthfulqa_mc2",
+ "tag": [
+ "truthfulqa"
+ ],
+ "dataset_path": "truthful_qa",
+ "dataset_name": "multiple_choice",
+ "validation_split": "validation",
+ "doc_to_text": "{% set prompt_qa = 'Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: I have no comment.\n\nQ: How does a telescope work?\nA: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\nQ: Where were the 1992 Olympics held?\nA: The 1992 Olympics were held in Barcelona, Spain.'%}{{prompt_qa + '\n\nQ: ' + question + '\nA:'}}",
+ "doc_to_target": 0,
+ "doc_to_choice": "{{mc2_targets.choices}}",
+ "process_results": "def process_results_mc2(doc, results):\n lls, is_greedy = zip(*results)\n\n # Split on the first `0` as everything before it is true (`1`).\n split_idx = list(doc[\"mc2_targets\"][\"labels\"]).index(0)\n # Compute the normalized probability mass for the correct answer.\n ll_true, ll_false = lls[:split_idx], lls[split_idx:]\n p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))\n p_true = p_true / (sum(p_true) + sum(p_false))\n\n return {\"acc\": sum(p_true)}\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "truthfulqa_mc2": 2.0
+ },
+ "n-shot": {
+ "truthfulqa_mc2": 0
+ },
+ "higher_is_better": {
+ "truthfulqa_mc2": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "truthfulqa_mc2": {
+ "original": 817,
+ "effective": 817
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736907663.6040406,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "truthfulqa_mc2": "b2a468babf2fac051de630e3e136ca3588387b755a38c843be1b929ca8bb21ab"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 617914.090583994,
+ "end_time": 617984.84129463,
+ "total_evaluation_time_seconds": "70.75071063591167"
+}
\ No newline at end of file
diff --git a/evaluations/en/Falcon3-7B-Instruct/winogrande_0_shot.json b/evaluations/en/Falcon3-7B-Instruct/winogrande_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..bea724291ba97ac0a30de41f96df64dd202bf109
--- /dev/null
+++ b/evaluations/en/Falcon3-7B-Instruct/winogrande_0_shot.json
@@ -0,0 +1,114 @@
+{
+ "results": {
+ "winogrande": {
+ "alias": "winogrande",
+ "acc,none": 0.7008681925808997,
+ "acc_stderr,none": 0.012868639066091541
+ }
+ },
+ "group_subtasks": {
+ "winogrande": []
+ },
+ "configs": {
+ "winogrande": {
+ "task": "winogrande",
+ "dataset_path": "winogrande",
+ "dataset_name": "winogrande_xl",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "def doc_to_text(doc):\n answer_to_num = {\"1\": 0, \"2\": 1}\n return answer_to_num[doc[\"answer\"]]\n",
+ "doc_to_target": "def doc_to_target(doc):\n idx = doc[\"sentence\"].index(\"_\") + 1\n return doc[\"sentence\"][idx:].strip()\n",
+ "doc_to_choice": "def doc_to_choice(doc):\n idx = doc[\"sentence\"].index(\"_\")\n options = [doc[\"option1\"], doc[\"option2\"]]\n return [doc[\"sentence\"][:idx] + opt for opt in options]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "sentence",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "winogrande": 1.0
+ },
+ "n-shot": {
+ "winogrande": 0
+ },
+ "higher_is_better": {
+ "winogrande": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "winogrande": {
+ "original": 1267,
+ "effective": 1267
+ }
+ },
+ "config": {
+ "model": "hf",
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
+ "model_num_parameters": 7455550464,
+ "model_dtype": "torch.bfloat16",
+ "model_revision": "main",
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "5e10e017",
+ "date": 1736907812.9122443,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.0",
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
+ "tokenizer_pad_token": [
+ "<|pad|>",
+ "2023"
+ ],
+ "tokenizer_eos_token": [
+ "<|endoftext|>",
+ "11"
+ ],
+ "tokenizer_bos_token": [
+ null,
+ "None"
+ ],
+ "eot_token_id": 11,
+ "max_length": 32768,
+ "task_hashes": {
+ "winogrande": "e985cb5c0b87f5487bd3c1e824fda62a51869a8dc2feb550c4853fde00a3b617"
+ },
+ "model_source": "hf",
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 618063.267604849,
+ "end_time": 618118.97434571,
+ "total_evaluation_time_seconds": "55.7067408610601"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/agieval_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..0dca56ecae88eaec2b35d17f676350108f04a4de
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/agieval_0_shot.json
@@ -0,0 +1,1108 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.5544267053701016,
+ "acc_stderr,none": 0.004859843455357734,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.3700787401574803,
+ "acc_stderr,none": 0.03035497929089593,
+ "acc_norm,none": 0.38188976377952755,
+ "acc_norm_stderr,none": 0.03054511159403859
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.7380952380952381,
+ "acc_stderr,none": 0.030412684459928757,
+ "acc_norm,none": 0.7047619047619048,
+ "acc_norm_stderr,none": 0.03155253554505398
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.4444444444444444,
+ "acc_stderr,none": 0.034620941824986436,
+ "acc_norm,none": 0.36231884057971014,
+ "acc_norm_stderr,none": 0.033489883876211865
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.5528455284552846,
+ "acc_stderr,none": 0.031764911338391044,
+ "acc_norm,none": 0.5447154471544715,
+ "acc_norm_stderr,none": 0.03181583027784235
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.8464052287581699,
+ "acc_stderr,none": 0.020645597910418787,
+ "acc_norm,none": 0.8431372549019608,
+ "acc_norm_stderr,none": 0.020823758837580905
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.7688442211055276,
+ "acc_stderr,none": 0.029959803439140443,
+ "acc_norm,none": 0.7638190954773869,
+ "acc_norm_stderr,none": 0.030184574030479208
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.7489361702127659,
+ "acc_stderr,none": 0.028346963777162452,
+ "acc_norm,none": 0.7361702127659574,
+ "acc_norm_stderr,none": 0.02880998985410295
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.025423728813559324,
+ "acc_stderr,none": 0.01455239952216708
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.4188034188034188,
+ "acc_stderr,none": 0.026371365163318804,
+ "acc_norm,none": 0.37606837606837606,
+ "acc_norm_stderr,none": 0.0258921362904796
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.034865138597849274,
+ "acc_norm,none": 0.56,
+ "acc_norm_stderr,none": 0.03518793763172071
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.6466466466466466,
+ "acc_stderr,none": 0.015131181922110867,
+ "acc_norm,none": 0.5565565565565566,
+ "acc_norm_stderr,none": 0.01572564618087532
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.703,
+ "acc_stderr,none": 0.0144568322948011,
+ "acc_norm,none": 0.629,
+ "acc_norm_stderr,none": 0.015283736211823187
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.5944700460829493,
+ "acc_stderr,none": 0.019258381208154284,
+ "acc_norm,none": 0.533026113671275,
+ "acc_norm_stderr,none": 0.01956878502638526
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.5775729646697388,
+ "acc_stderr,none": 0.01937414753071922,
+ "acc_norm,none": 0.5253456221198156,
+ "acc_norm_stderr,none": 0.019586400283373922
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.33043478260869563,
+ "acc_stderr,none": 0.031082903446842964,
+ "acc_norm,none": 0.33043478260869563,
+ "acc_norm_stderr,none": 0.031082903446842964
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.7235294117647059,
+ "acc_stderr,none": 0.019824108780753007,
+ "acc_norm,none": 0.6313725490196078,
+ "acc_norm_stderr,none": 0.021383450873181317
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.7992565055762082,
+ "acc_stderr,none": 0.024467885125224527,
+ "acc_norm,none": 0.6728624535315985,
+ "acc_norm_stderr,none": 0.02865899432669078
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.069,
+ "acc_stderr,none": 0.008018934050315138
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.8640776699029126,
+ "acc_stderr,none": 0.023935630169275284,
+ "acc_norm,none": 0.7669902912621359,
+ "acc_norm_stderr,none": 0.029526026912337827
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.5145631067961165,
+ "acc_stderr,none": 0.034906699050989067,
+ "acc_norm,none": 0.4320388349514563,
+ "acc_norm_stderr,none": 0.0345974255383149
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.5727272727272728,
+ "acc_stderr,none": 0.03342754338309286,
+ "acc_norm,none": 0.5227272727272727,
+ "acc_norm_stderr,none": 0.03375194708230163
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.5544267053701016,
+ "acc_stderr,none": 0.004859843455357734,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737578738.814069,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 120759.780132137,
+ "end_time": 122538.423654986,
+ "total_evaluation_time_seconds": "1778.6435228490009"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/arc_challenge_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7924678e07bb9ba26083fac2bb682b1964e4df83
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/arc_challenge_0_shot.json
@@ -0,0 +1,117 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.6117747440273038,
+ "acc_stderr,none": 0.014241614207414047,
+ "acc_norm,none": 0.6339590443686007,
+ "acc_norm_stderr,none": 0.014077223108470134
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737581843.4494154,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 123864.353343428,
+ "end_time": 123962.742418921,
+ "total_evaluation_time_seconds": "98.38907549300347"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/gpqa_main_n_shot_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..4262e0bb6fc5e1faa95f9122f77f6f5cf67c457e
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.25892857142857145,
+ "acc_stderr,none": 0.020718879324472143,
+ "acc_norm,none": 0.25892857142857145,
+ "acc_norm_stderr,none": 0.020718879324472143
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737587163.2574375,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 129184.190027017,
+ "end_time": 129313.238046962,
+ "total_evaluation_time_seconds": "129.04801994499576"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/gsm8k_5_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..c64435b18e06c6e38bc4d2c2cf64718646d46e88
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/gsm8k_5_shot.json
@@ -0,0 +1,153 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.9082638362395754,
+ "exact_match_stderr,strict-match": 0.00795094214833935,
+ "exact_match,flexible-extract": 0.935557240333586,
+ "exact_match_stderr,flexible-extract": 0.0067633917284882555
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737587329.0756748,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 129350.110628712,
+ "end_time": 129590.582331698,
+ "total_evaluation_time_seconds": "240.4717029859894"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/hellaswag_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..4eb3e7529d1737ffcc7728f4ae1d357a5786bcca
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/hellaswag_0_shot.json
@@ -0,0 +1,118 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.657239593706433,
+ "acc_stderr,none": 0.004736621698861193,
+ "acc_norm,none": 0.843855805616411,
+ "acc_norm_stderr,none": 0.003622501370331856
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737582214.4104311,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 124235.149145965,
+ "end_time": 124763.573958303,
+ "total_evaluation_time_seconds": "528.4248123379948"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/hendrycks_ethics_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ff51369ddd058f214ee5d534d6487ce57b239363
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/hendrycks_ethics_0_shot.json
@@ -0,0 +1,307 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.8023166023166023,
+ "acc_stderr,none": 0.006390257774878015
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.6298665183537263,
+ "acc_stderr,none": 0.008052931418172102
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.8557692307692307,
+ "acc_stderr,none": 0.006757472246675016
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.8148918469217971,
+ "acc_stderr,none": 0.005601775490890298
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.9495477386934673,
+ "acc_stderr,none": 0.003103457695116678
+ }
+ },
+ "group_subtasks": {
+ "ethics_deontology": [],
+ "ethics_justice": [],
+ "ethics_cm": [],
+ "ethics_utilitarianism": [],
+ "ethics_virtue": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ },
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ },
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737580554.1132338,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 122574.978636081,
+ "end_time": 123057.366655506,
+ "total_evaluation_time_seconds": "482.3880194250087"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/ifeval_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..20678f8fe9ab0def56e77a9ba23b5fb732469eff
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/ifeval_0_shot.json
@@ -0,0 +1,132 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.6321626617375231,
+ "prompt_level_strict_acc_stderr,none": 0.02075130655602969,
+ "inst_level_strict_acc,none": 0.7278177458033573,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.7005545286506469,
+ "prompt_level_loose_acc_stderr,none": 0.019709834029672916,
+ "inst_level_loose_acc,none": 0.7781774580335732,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737584656.560232,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 126677.523972637,
+ "end_time": 126852.930489088,
+ "total_evaluation_time_seconds": "175.4065164509957"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/minerva_math_4_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..209b165ae7e8bee7dcb994474c3433a3ee34fbae
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/minerva_math_4_shot.json
@@ -0,0 +1,521 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.4642,
+ "exact_match_stderr,none": 0.006628889249601153,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.6293176074136478,
+ "exact_match_stderr,none": 0.01402469985709588
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.5253164556962026,
+ "exact_match_stderr,none": 0.02296053591387607
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.4154488517745303,
+ "exact_match_stderr,none": 0.022540113165977028
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.22591362126245848,
+ "exact_match_stderr,none": 0.013923956329164374
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.45925925925925926,
+ "exact_match_stderr,none": 0.021464912562702897
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.6383467278989667,
+ "exact_match_stderr,none": 0.016289767709994334
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.21611721611721613,
+ "exact_match_stderr,none": 0.017630799001234886
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.4642,
+ "exact_match_stderr,none": 0.006628889249601153,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_precalc": {
+ "task": "minerva_math_precalc",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "precalculus",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "minerva_math": 1.0,
+ "minerva_math_algebra": 1.0,
+ "minerva_math_counting_and_prob": 1.0,
+ "minerva_math_geometry": 1.0,
+ "minerva_math_intermediate_algebra": 1.0,
+ "minerva_math_num_theory": 1.0,
+ "minerva_math_prealgebra": 1.0,
+ "minerva_math_precalc": 1.0
+ },
+ "n-shot": {
+ "minerva_math_algebra": 4,
+ "minerva_math_counting_and_prob": 4,
+ "minerva_math_geometry": 4,
+ "minerva_math_intermediate_algebra": 4,
+ "minerva_math_num_theory": 4,
+ "minerva_math_prealgebra": 4,
+ "minerva_math_precalc": 4
+ },
+ "higher_is_better": {
+ "minerva_math": {
+ "exact_match": true
+ },
+ "minerva_math_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_counting_and_prob": {
+ "exact_match": true
+ },
+ "minerva_math_geometry": {
+ "exact_match": true
+ },
+ "minerva_math_intermediate_algebra": {
+ "exact_match": true
+ },
+ "minerva_math_num_theory": {
+ "exact_match": true
+ },
+ "minerva_math_prealgebra": {
+ "exact_match": true
+ },
+ "minerva_math_precalc": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "minerva_math_algebra": {
+ "original": 1187,
+ "effective": 1187
+ },
+ "minerva_math_counting_and_prob": {
+ "original": 474,
+ "effective": 474
+ },
+ "minerva_math_geometry": {
+ "original": 479,
+ "effective": 479
+ },
+ "minerva_math_intermediate_algebra": {
+ "original": 903,
+ "effective": 903
+ },
+ "minerva_math_num_theory": {
+ "original": 540,
+ "effective": 540
+ },
+ "minerva_math_prealgebra": {
+ "original": 871,
+ "effective": 871
+ },
+ "minerva_math_precalc": {
+ "original": 546,
+ "effective": 546
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737583466.5454865,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 125487.461297843,
+ "end_time": 126234.645678455,
+ "total_evaluation_time_seconds": "747.1843806120014"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/mmlu_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/mmlu_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7f6940c0a111a81598f41a0f0dd0da881f2c8e7c
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/mmlu_0_shot.json
@@ -0,0 +1,3283 @@
+{
+ "results": {
+ "mmlu": {
+ "acc,none": 0.819897450505626,
+ "acc_stderr,none": 0.0031087150831215155,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.8104144527098831,
+ "acc_stderr,none": 0.005519815358782114,
+ "alias": " - humanities"
+ },
+ "mmlu_formal_logic": {
+ "alias": " - formal_logic",
+ "acc,none": 0.6746031746031746,
+ "acc_stderr,none": 0.04190596438871136
+ },
+ "mmlu_high_school_european_history": {
+ "alias": " - high_school_european_history",
+ "acc,none": 0.8424242424242424,
+ "acc_stderr,none": 0.02845038880528436
+ },
+ "mmlu_high_school_us_history": {
+ "alias": " - high_school_us_history",
+ "acc,none": 0.946078431372549,
+ "acc_stderr,none": 0.015852465281106908
+ },
+ "mmlu_high_school_world_history": {
+ "alias": " - high_school_world_history",
+ "acc,none": 0.9240506329113924,
+ "acc_stderr,none": 0.017244633251065695
+ },
+ "mmlu_international_law": {
+ "alias": " - international_law",
+ "acc,none": 0.8925619834710744,
+ "acc_stderr,none": 0.028268812192540627
+ },
+ "mmlu_jurisprudence": {
+ "alias": " - jurisprudence",
+ "acc,none": 0.8611111111111112,
+ "acc_stderr,none": 0.03343270062869622
+ },
+ "mmlu_logical_fallacies": {
+ "alias": " - logical_fallacies",
+ "acc,none": 0.8895705521472392,
+ "acc_stderr,none": 0.024624937788941318
+ },
+ "mmlu_moral_disputes": {
+ "alias": " - moral_disputes",
+ "acc,none": 0.8583815028901735,
+ "acc_stderr,none": 0.018771138684059014
+ },
+ "mmlu_moral_scenarios": {
+ "alias": " - moral_scenarios",
+ "acc,none": 0.8737430167597765,
+ "acc_stderr,none": 0.01110838193631582
+ },
+ "mmlu_philosophy": {
+ "alias": " - philosophy",
+ "acc,none": 0.8681672025723473,
+ "acc_stderr,none": 0.019214654265652387
+ },
+ "mmlu_prehistory": {
+ "alias": " - prehistory",
+ "acc,none": 0.904320987654321,
+ "acc_stderr,none": 0.016366973744175266
+ },
+ "mmlu_professional_law": {
+ "alias": " - professional_law",
+ "acc,none": 0.6734028683181226,
+ "acc_stderr,none": 0.011977676704715999
+ },
+ "mmlu_world_religions": {
+ "alias": " - world_religions",
+ "acc,none": 0.9122807017543859,
+ "acc_stderr,none": 0.02169638394388924
+ },
+ "mmlu_other": {
+ "acc,none": 0.8419697457354361,
+ "acc_stderr,none": 0.006258463660583839,
+ "alias": " - other"
+ },
+ "mmlu_business_ethics": {
+ "alias": " - business_ethics",
+ "acc,none": 0.81,
+ "acc_stderr,none": 0.03942772444036625
+ },
+ "mmlu_clinical_knowledge": {
+ "alias": " - clinical_knowledge",
+ "acc,none": 0.8415094339622642,
+ "acc_stderr,none": 0.022476528710167712
+ },
+ "mmlu_college_medicine": {
+ "alias": " - college_medicine",
+ "acc,none": 0.7572254335260116,
+ "acc_stderr,none": 0.0326926380614177
+ },
+ "mmlu_global_facts": {
+ "alias": " - global_facts",
+ "acc,none": 0.61,
+ "acc_stderr,none": 0.04902071300001975
+ },
+ "mmlu_human_aging": {
+ "alias": " - human_aging",
+ "acc,none": 0.820627802690583,
+ "acc_stderr,none": 0.025749819569192804
+ },
+ "mmlu_management": {
+ "alias": " - management",
+ "acc,none": 0.9029126213592233,
+ "acc_stderr,none": 0.02931596291881347
+ },
+ "mmlu_marketing": {
+ "alias": " - marketing",
+ "acc,none": 0.9273504273504274,
+ "acc_stderr,none": 0.017004368568132366
+ },
+ "mmlu_medical_genetics": {
+ "alias": " - medical_genetics",
+ "acc,none": 0.9,
+ "acc_stderr,none": 0.030151134457776334
+ },
+ "mmlu_miscellaneous": {
+ "alias": " - miscellaneous",
+ "acc,none": 0.929757343550447,
+ "acc_stderr,none": 0.009138646868032285
+ },
+ "mmlu_nutrition": {
+ "alias": " - nutrition",
+ "acc,none": 0.8954248366013072,
+ "acc_stderr,none": 0.017521808294174466
+ },
+ "mmlu_professional_accounting": {
+ "alias": " - professional_accounting",
+ "acc,none": 0.6808510638297872,
+ "acc_stderr,none": 0.027807990141320196
+ },
+ "mmlu_professional_medicine": {
+ "alias": " - professional_medicine",
+ "acc,none": 0.9117647058823529,
+ "acc_stderr,none": 0.017229707781039032
+ },
+ "mmlu_virology": {
+ "alias": " - virology",
+ "acc,none": 0.572289156626506,
+ "acc_stderr,none": 0.038515976837185335
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.8813779655508612,
+ "acc_stderr,none": 0.005724484350303844,
+ "alias": " - social sciences"
+ },
+ "mmlu_econometrics": {
+ "alias": " - econometrics",
+ "acc,none": 0.7017543859649122,
+ "acc_stderr,none": 0.04303684033537315
+ },
+ "mmlu_high_school_geography": {
+ "alias": " - high_school_geography",
+ "acc,none": 0.9393939393939394,
+ "acc_stderr,none": 0.016999994927421613
+ },
+ "mmlu_high_school_government_and_politics": {
+ "alias": " - high_school_government_and_politics",
+ "acc,none": 0.9740932642487047,
+ "acc_stderr,none": 0.011464523356953176
+ },
+ "mmlu_high_school_macroeconomics": {
+ "alias": " - high_school_macroeconomics",
+ "acc,none": 0.8615384615384616,
+ "acc_stderr,none": 0.017511651708913754
+ },
+ "mmlu_high_school_microeconomics": {
+ "alias": " - high_school_microeconomics",
+ "acc,none": 0.9033613445378151,
+ "acc_stderr,none": 0.019192520709708723
+ },
+ "mmlu_high_school_psychology": {
+ "alias": " - high_school_psychology",
+ "acc,none": 0.9412844036697248,
+ "acc_stderr,none": 0.010079470534014019
+ },
+ "mmlu_human_sexuality": {
+ "alias": " - human_sexuality",
+ "acc,none": 0.8549618320610687,
+ "acc_stderr,none": 0.030884661089515382
+ },
+ "mmlu_professional_psychology": {
+ "alias": " - professional_psychology",
+ "acc,none": 0.8545751633986928,
+ "acc_stderr,none": 0.014261782879481027
+ },
+ "mmlu_public_relations": {
+ "alias": " - public_relations",
+ "acc,none": 0.7363636363636363,
+ "acc_stderr,none": 0.04220224692971987
+ },
+ "mmlu_security_studies": {
+ "alias": " - security_studies",
+ "acc,none": 0.8163265306122449,
+ "acc_stderr,none": 0.024789071332007626
+ },
+ "mmlu_sociology": {
+ "alias": " - sociology",
+ "acc,none": 0.9203980099502488,
+ "acc_stderr,none": 0.019139685633503815
+ },
+ "mmlu_us_foreign_policy": {
+ "alias": " - us_foreign_policy",
+ "acc,none": 0.93,
+ "acc_stderr,none": 0.025643239997624294
+ },
+ "mmlu_stem": {
+ "acc,none": 0.7522993973993023,
+ "acc_stderr,none": 0.007389783284914271,
+ "alias": " - stem"
+ },
+ "mmlu_abstract_algebra": {
+ "alias": " - abstract_algebra",
+ "acc,none": 0.6,
+ "acc_stderr,none": 0.04923659639173309
+ },
+ "mmlu_anatomy": {
+ "alias": " - anatomy",
+ "acc,none": 0.8296296296296296,
+ "acc_stderr,none": 0.03247781185995593
+ },
+ "mmlu_astronomy": {
+ "alias": " - astronomy",
+ "acc,none": 0.9078947368421053,
+ "acc_stderr,none": 0.02353268597044349
+ },
+ "mmlu_college_biology": {
+ "alias": " - college_biology",
+ "acc,none": 0.9166666666666666,
+ "acc_stderr,none": 0.023112508176051233
+ },
+ "mmlu_college_chemistry": {
+ "alias": " - college_chemistry",
+ "acc,none": 0.59,
+ "acc_stderr,none": 0.04943110704237102
+ },
+ "mmlu_college_computer_science": {
+ "alias": " - college_computer_science",
+ "acc,none": 0.67,
+ "acc_stderr,none": 0.04725815626252607
+ },
+ "mmlu_college_mathematics": {
+ "alias": " - college_mathematics",
+ "acc,none": 0.55,
+ "acc_stderr,none": 0.05
+ },
+ "mmlu_college_physics": {
+ "alias": " - college_physics",
+ "acc,none": 0.6470588235294118,
+ "acc_stderr,none": 0.04755129616062947
+ },
+ "mmlu_computer_security": {
+ "alias": " - computer_security",
+ "acc,none": 0.84,
+ "acc_stderr,none": 0.03684529491774707
+ },
+ "mmlu_conceptual_physics": {
+ "alias": " - conceptual_physics",
+ "acc,none": 0.8297872340425532,
+ "acc_stderr,none": 0.0245680965612607
+ },
+ "mmlu_electrical_engineering": {
+ "alias": " - electrical_engineering",
+ "acc,none": 0.7655172413793103,
+ "acc_stderr,none": 0.035306258743465914
+ },
+ "mmlu_elementary_mathematics": {
+ "alias": " - elementary_mathematics",
+ "acc,none": 0.7592592592592593,
+ "acc_stderr,none": 0.02201908001221789
+ },
+ "mmlu_high_school_biology": {
+ "alias": " - high_school_biology",
+ "acc,none": 0.9129032258064517,
+ "acc_stderr,none": 0.01604110074169668
+ },
+ "mmlu_high_school_chemistry": {
+ "alias": " - high_school_chemistry",
+ "acc,none": 0.7536945812807881,
+ "acc_stderr,none": 0.030315099285617732
+ },
+ "mmlu_high_school_computer_science": {
+ "alias": " - high_school_computer_science",
+ "acc,none": 0.92,
+ "acc_stderr,none": 0.027265992434429086
+ },
+ "mmlu_high_school_mathematics": {
+ "alias": " - high_school_mathematics",
+ "acc,none": 0.5370370370370371,
+ "acc_stderr,none": 0.03040178640610151
+ },
+ "mmlu_high_school_physics": {
+ "alias": " - high_school_physics",
+ "acc,none": 0.6225165562913907,
+ "acc_stderr,none": 0.0395802723112157
+ },
+ "mmlu_high_school_statistics": {
+ "alias": " - high_school_statistics",
+ "acc,none": 0.7546296296296297,
+ "acc_stderr,none": 0.029346665094372948
+ },
+ "mmlu_machine_learning": {
+ "alias": " - machine_learning",
+ "acc,none": 0.6785714285714286,
+ "acc_stderr,none": 0.04432804055291519
+ }
+ },
+ "groups": {
+ "mmlu": {
+ "acc,none": 0.819897450505626,
+ "acc_stderr,none": 0.0031087150831215155,
+ "alias": "mmlu"
+ },
+ "mmlu_humanities": {
+ "acc,none": 0.8104144527098831,
+ "acc_stderr,none": 0.005519815358782114,
+ "alias": " - humanities"
+ },
+ "mmlu_other": {
+ "acc,none": 0.8419697457354361,
+ "acc_stderr,none": 0.006258463660583839,
+ "alias": " - other"
+ },
+ "mmlu_social_sciences": {
+ "acc,none": 0.8813779655508612,
+ "acc_stderr,none": 0.005724484350303844,
+ "alias": " - social sciences"
+ },
+ "mmlu_stem": {
+ "acc,none": 0.7522993973993023,
+ "acc_stderr,none": 0.007389783284914271,
+ "alias": " - stem"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_humanities": [
+ "mmlu_moral_scenarios",
+ "mmlu_formal_logic",
+ "mmlu_high_school_european_history",
+ "mmlu_high_school_world_history",
+ "mmlu_high_school_us_history",
+ "mmlu_international_law",
+ "mmlu_professional_law",
+ "mmlu_logical_fallacies",
+ "mmlu_prehistory",
+ "mmlu_moral_disputes",
+ "mmlu_world_religions",
+ "mmlu_philosophy",
+ "mmlu_jurisprudence"
+ ],
+ "mmlu_social_sciences": [
+ "mmlu_econometrics",
+ "mmlu_public_relations",
+ "mmlu_security_studies",
+ "mmlu_professional_psychology",
+ "mmlu_sociology",
+ "mmlu_us_foreign_policy",
+ "mmlu_human_sexuality",
+ "mmlu_high_school_government_and_politics",
+ "mmlu_high_school_macroeconomics",
+ "mmlu_high_school_geography",
+ "mmlu_high_school_psychology",
+ "mmlu_high_school_microeconomics"
+ ],
+ "mmlu_other": [
+ "mmlu_human_aging",
+ "mmlu_miscellaneous",
+ "mmlu_professional_medicine",
+ "mmlu_college_medicine",
+ "mmlu_clinical_knowledge",
+ "mmlu_marketing",
+ "mmlu_business_ethics",
+ "mmlu_global_facts",
+ "mmlu_professional_accounting",
+ "mmlu_virology",
+ "mmlu_nutrition",
+ "mmlu_management",
+ "mmlu_medical_genetics"
+ ],
+ "mmlu_stem": [
+ "mmlu_college_mathematics",
+ "mmlu_college_chemistry",
+ "mmlu_college_physics",
+ "mmlu_high_school_biology",
+ "mmlu_astronomy",
+ "mmlu_college_computer_science",
+ "mmlu_conceptual_physics",
+ "mmlu_high_school_chemistry",
+ "mmlu_high_school_statistics",
+ "mmlu_electrical_engineering",
+ "mmlu_abstract_algebra",
+ "mmlu_high_school_mathematics",
+ "mmlu_high_school_physics",
+ "mmlu_high_school_computer_science",
+ "mmlu_machine_learning",
+ "mmlu_anatomy",
+ "mmlu_elementary_mathematics",
+ "mmlu_college_biology",
+ "mmlu_computer_security"
+ ],
+ "mmlu": [
+ "mmlu_stem",
+ "mmlu_other",
+ "mmlu_social_sciences",
+ "mmlu_humanities"
+ ]
+ },
+ "configs": {
+ "mmlu_abstract_algebra": {
+ "task": "mmlu_abstract_algebra",
+ "task_alias": "abstract_algebra",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "abstract_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about abstract algebra.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_anatomy": {
+ "task": "mmlu_anatomy",
+ "task_alias": "anatomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "anatomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about anatomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_astronomy": {
+ "task": "mmlu_astronomy",
+ "task_alias": "astronomy",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "astronomy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about astronomy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_business_ethics": {
+ "task": "mmlu_business_ethics",
+ "task_alias": "business_ethics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "business_ethics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about business ethics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_clinical_knowledge": {
+ "task": "mmlu_clinical_knowledge",
+ "task_alias": "clinical_knowledge",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "clinical_knowledge",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about clinical knowledge.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_biology": {
+ "task": "mmlu_college_biology",
+ "task_alias": "college_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_chemistry": {
+ "task": "mmlu_college_chemistry",
+ "task_alias": "college_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_computer_science": {
+ "task": "mmlu_college_computer_science",
+ "task_alias": "college_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_mathematics": {
+ "task": "mmlu_college_mathematics",
+ "task_alias": "college_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_medicine": {
+ "task": "mmlu_college_medicine",
+ "task_alias": "college_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_college_physics": {
+ "task": "mmlu_college_physics",
+ "task_alias": "college_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "college_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about college physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_computer_security": {
+ "task": "mmlu_computer_security",
+ "task_alias": "computer_security",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "computer_security",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about computer security.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_conceptual_physics": {
+ "task": "mmlu_conceptual_physics",
+ "task_alias": "conceptual_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "conceptual_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about conceptual physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_econometrics": {
+ "task": "mmlu_econometrics",
+ "task_alias": "econometrics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "econometrics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about econometrics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_electrical_engineering": {
+ "task": "mmlu_electrical_engineering",
+ "task_alias": "electrical_engineering",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "electrical_engineering",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about electrical engineering.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_elementary_mathematics": {
+ "task": "mmlu_elementary_mathematics",
+ "task_alias": "elementary_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "elementary_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about elementary mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_formal_logic": {
+ "task": "mmlu_formal_logic",
+ "task_alias": "formal_logic",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "formal_logic",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about formal logic.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_global_facts": {
+ "task": "mmlu_global_facts",
+ "task_alias": "global_facts",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "global_facts",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about global facts.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_biology": {
+ "task": "mmlu_high_school_biology",
+ "task_alias": "high_school_biology",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_biology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school biology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_chemistry": {
+ "task": "mmlu_high_school_chemistry",
+ "task_alias": "high_school_chemistry",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_chemistry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school chemistry.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_computer_science": {
+ "task": "mmlu_high_school_computer_science",
+ "task_alias": "high_school_computer_science",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_computer_science",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school computer science.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_european_history": {
+ "task": "mmlu_high_school_european_history",
+ "task_alias": "high_school_european_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_european_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school european history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_geography": {
+ "task": "mmlu_high_school_geography",
+ "task_alias": "high_school_geography",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_geography",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school geography.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_government_and_politics": {
+ "task": "mmlu_high_school_government_and_politics",
+ "task_alias": "high_school_government_and_politics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_government_and_politics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school government and politics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_macroeconomics": {
+ "task": "mmlu_high_school_macroeconomics",
+ "task_alias": "high_school_macroeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_macroeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school macroeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_mathematics": {
+ "task": "mmlu_high_school_mathematics",
+ "task_alias": "high_school_mathematics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_mathematics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school mathematics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_microeconomics": {
+ "task": "mmlu_high_school_microeconomics",
+ "task_alias": "high_school_microeconomics",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_microeconomics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school microeconomics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_physics": {
+ "task": "mmlu_high_school_physics",
+ "task_alias": "high_school_physics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_physics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school physics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_psychology": {
+ "task": "mmlu_high_school_psychology",
+ "task_alias": "high_school_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_statistics": {
+ "task": "mmlu_high_school_statistics",
+ "task_alias": "high_school_statistics",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_statistics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school statistics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_us_history": {
+ "task": "mmlu_high_school_us_history",
+ "task_alias": "high_school_us_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_us_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school us history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_high_school_world_history": {
+ "task": "mmlu_high_school_world_history",
+ "task_alias": "high_school_world_history",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "high_school_world_history",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about high school world history.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_aging": {
+ "task": "mmlu_human_aging",
+ "task_alias": "human_aging",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_aging",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human aging.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_human_sexuality": {
+ "task": "mmlu_human_sexuality",
+ "task_alias": "human_sexuality",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "human_sexuality",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about human sexuality.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_international_law": {
+ "task": "mmlu_international_law",
+ "task_alias": "international_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "international_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about international law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_jurisprudence": {
+ "task": "mmlu_jurisprudence",
+ "task_alias": "jurisprudence",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "jurisprudence",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about jurisprudence.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_logical_fallacies": {
+ "task": "mmlu_logical_fallacies",
+ "task_alias": "logical_fallacies",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "logical_fallacies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about logical fallacies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_machine_learning": {
+ "task": "mmlu_machine_learning",
+ "task_alias": "machine_learning",
+ "tag": "mmlu_stem_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "machine_learning",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about machine learning.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_management": {
+ "task": "mmlu_management",
+ "task_alias": "management",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "management",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about management.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_marketing": {
+ "task": "mmlu_marketing",
+ "task_alias": "marketing",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "marketing",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about marketing.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_medical_genetics": {
+ "task": "mmlu_medical_genetics",
+ "task_alias": "medical_genetics",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "medical_genetics",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about medical genetics.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_miscellaneous": {
+ "task": "mmlu_miscellaneous",
+ "task_alias": "miscellaneous",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "miscellaneous",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about miscellaneous.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_disputes": {
+ "task": "mmlu_moral_disputes",
+ "task_alias": "moral_disputes",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_disputes",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral disputes.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_moral_scenarios": {
+ "task": "mmlu_moral_scenarios",
+ "task_alias": "moral_scenarios",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "moral_scenarios",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about moral scenarios.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_nutrition": {
+ "task": "mmlu_nutrition",
+ "task_alias": "nutrition",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "nutrition",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about nutrition.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_philosophy": {
+ "task": "mmlu_philosophy",
+ "task_alias": "philosophy",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "philosophy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about philosophy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_prehistory": {
+ "task": "mmlu_prehistory",
+ "task_alias": "prehistory",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "prehistory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about prehistory.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_accounting": {
+ "task": "mmlu_professional_accounting",
+ "task_alias": "professional_accounting",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_accounting",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional accounting.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_law": {
+ "task": "mmlu_professional_law",
+ "task_alias": "professional_law",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_law",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional law.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_medicine": {
+ "task": "mmlu_professional_medicine",
+ "task_alias": "professional_medicine",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_medicine",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional medicine.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_professional_psychology": {
+ "task": "mmlu_professional_psychology",
+ "task_alias": "professional_psychology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "professional_psychology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about professional psychology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_public_relations": {
+ "task": "mmlu_public_relations",
+ "task_alias": "public_relations",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "public_relations",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about public relations.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_security_studies": {
+ "task": "mmlu_security_studies",
+ "task_alias": "security_studies",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "security_studies",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about security studies.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_sociology": {
+ "task": "mmlu_sociology",
+ "task_alias": "sociology",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "sociology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about sociology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_us_foreign_policy": {
+ "task": "mmlu_us_foreign_policy",
+ "task_alias": "us_foreign_policy",
+ "tag": "mmlu_social_sciences_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "us_foreign_policy",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about us foreign policy.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_virology": {
+ "task": "mmlu_virology",
+ "task_alias": "virology",
+ "tag": "mmlu_other_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "virology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about virology.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_world_religions": {
+ "task": "mmlu_world_religions",
+ "task_alias": "world_religions",
+ "tag": "mmlu_humanities_tasks",
+ "dataset_path": "hails/mmlu_no_train",
+ "dataset_name": "world_religions",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "test_split": "test",
+ "fewshot_split": "dev",
+ "doc_to_text": "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "A",
+ "B",
+ "C",
+ "D"
+ ],
+ "description": "The following are multiple choice questions (with answers) about world religions.\n\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n"
+ },
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu": 2,
+ "mmlu_abstract_algebra": 1.0,
+ "mmlu_anatomy": 1.0,
+ "mmlu_astronomy": 1.0,
+ "mmlu_business_ethics": 1.0,
+ "mmlu_clinical_knowledge": 1.0,
+ "mmlu_college_biology": 1.0,
+ "mmlu_college_chemistry": 1.0,
+ "mmlu_college_computer_science": 1.0,
+ "mmlu_college_mathematics": 1.0,
+ "mmlu_college_medicine": 1.0,
+ "mmlu_college_physics": 1.0,
+ "mmlu_computer_security": 1.0,
+ "mmlu_conceptual_physics": 1.0,
+ "mmlu_econometrics": 1.0,
+ "mmlu_electrical_engineering": 1.0,
+ "mmlu_elementary_mathematics": 1.0,
+ "mmlu_formal_logic": 1.0,
+ "mmlu_global_facts": 1.0,
+ "mmlu_high_school_biology": 1.0,
+ "mmlu_high_school_chemistry": 1.0,
+ "mmlu_high_school_computer_science": 1.0,
+ "mmlu_high_school_european_history": 1.0,
+ "mmlu_high_school_geography": 1.0,
+ "mmlu_high_school_government_and_politics": 1.0,
+ "mmlu_high_school_macroeconomics": 1.0,
+ "mmlu_high_school_mathematics": 1.0,
+ "mmlu_high_school_microeconomics": 1.0,
+ "mmlu_high_school_physics": 1.0,
+ "mmlu_high_school_psychology": 1.0,
+ "mmlu_high_school_statistics": 1.0,
+ "mmlu_high_school_us_history": 1.0,
+ "mmlu_high_school_world_history": 1.0,
+ "mmlu_human_aging": 1.0,
+ "mmlu_human_sexuality": 1.0,
+ "mmlu_humanities": 2,
+ "mmlu_international_law": 1.0,
+ "mmlu_jurisprudence": 1.0,
+ "mmlu_logical_fallacies": 1.0,
+ "mmlu_machine_learning": 1.0,
+ "mmlu_management": 1.0,
+ "mmlu_marketing": 1.0,
+ "mmlu_medical_genetics": 1.0,
+ "mmlu_miscellaneous": 1.0,
+ "mmlu_moral_disputes": 1.0,
+ "mmlu_moral_scenarios": 1.0,
+ "mmlu_nutrition": 1.0,
+ "mmlu_other": 2,
+ "mmlu_philosophy": 1.0,
+ "mmlu_prehistory": 1.0,
+ "mmlu_professional_accounting": 1.0,
+ "mmlu_professional_law": 1.0,
+ "mmlu_professional_medicine": 1.0,
+ "mmlu_professional_psychology": 1.0,
+ "mmlu_public_relations": 1.0,
+ "mmlu_security_studies": 1.0,
+ "mmlu_social_sciences": 2,
+ "mmlu_sociology": 1.0,
+ "mmlu_stem": 2,
+ "mmlu_us_foreign_policy": 1.0,
+ "mmlu_virology": 1.0,
+ "mmlu_world_religions": 1.0
+ },
+ "n-shot": {
+ "mmlu_abstract_algebra": 0,
+ "mmlu_anatomy": 0,
+ "mmlu_astronomy": 0,
+ "mmlu_business_ethics": 0,
+ "mmlu_clinical_knowledge": 0,
+ "mmlu_college_biology": 0,
+ "mmlu_college_chemistry": 0,
+ "mmlu_college_computer_science": 0,
+ "mmlu_college_mathematics": 0,
+ "mmlu_college_medicine": 0,
+ "mmlu_college_physics": 0,
+ "mmlu_computer_security": 0,
+ "mmlu_conceptual_physics": 0,
+ "mmlu_econometrics": 0,
+ "mmlu_electrical_engineering": 0,
+ "mmlu_elementary_mathematics": 0,
+ "mmlu_formal_logic": 0,
+ "mmlu_global_facts": 0,
+ "mmlu_high_school_biology": 0,
+ "mmlu_high_school_chemistry": 0,
+ "mmlu_high_school_computer_science": 0,
+ "mmlu_high_school_european_history": 0,
+ "mmlu_high_school_geography": 0,
+ "mmlu_high_school_government_and_politics": 0,
+ "mmlu_high_school_macroeconomics": 0,
+ "mmlu_high_school_mathematics": 0,
+ "mmlu_high_school_microeconomics": 0,
+ "mmlu_high_school_physics": 0,
+ "mmlu_high_school_psychology": 0,
+ "mmlu_high_school_statistics": 0,
+ "mmlu_high_school_us_history": 0,
+ "mmlu_high_school_world_history": 0,
+ "mmlu_human_aging": 0,
+ "mmlu_human_sexuality": 0,
+ "mmlu_international_law": 0,
+ "mmlu_jurisprudence": 0,
+ "mmlu_logical_fallacies": 0,
+ "mmlu_machine_learning": 0,
+ "mmlu_management": 0,
+ "mmlu_marketing": 0,
+ "mmlu_medical_genetics": 0,
+ "mmlu_miscellaneous": 0,
+ "mmlu_moral_disputes": 0,
+ "mmlu_moral_scenarios": 0,
+ "mmlu_nutrition": 0,
+ "mmlu_philosophy": 0,
+ "mmlu_prehistory": 0,
+ "mmlu_professional_accounting": 0,
+ "mmlu_professional_law": 0,
+ "mmlu_professional_medicine": 0,
+ "mmlu_professional_psychology": 0,
+ "mmlu_public_relations": 0,
+ "mmlu_security_studies": 0,
+ "mmlu_sociology": 0,
+ "mmlu_us_foreign_policy": 0,
+ "mmlu_virology": 0,
+ "mmlu_world_religions": 0
+ },
+ "higher_is_better": {
+ "mmlu": {
+ "acc": true
+ },
+ "mmlu_abstract_algebra": {
+ "acc": true
+ },
+ "mmlu_anatomy": {
+ "acc": true
+ },
+ "mmlu_astronomy": {
+ "acc": true
+ },
+ "mmlu_business_ethics": {
+ "acc": true
+ },
+ "mmlu_clinical_knowledge": {
+ "acc": true
+ },
+ "mmlu_college_biology": {
+ "acc": true
+ },
+ "mmlu_college_chemistry": {
+ "acc": true
+ },
+ "mmlu_college_computer_science": {
+ "acc": true
+ },
+ "mmlu_college_mathematics": {
+ "acc": true
+ },
+ "mmlu_college_medicine": {
+ "acc": true
+ },
+ "mmlu_college_physics": {
+ "acc": true
+ },
+ "mmlu_computer_security": {
+ "acc": true
+ },
+ "mmlu_conceptual_physics": {
+ "acc": true
+ },
+ "mmlu_econometrics": {
+ "acc": true
+ },
+ "mmlu_electrical_engineering": {
+ "acc": true
+ },
+ "mmlu_elementary_mathematics": {
+ "acc": true
+ },
+ "mmlu_formal_logic": {
+ "acc": true
+ },
+ "mmlu_global_facts": {
+ "acc": true
+ },
+ "mmlu_high_school_biology": {
+ "acc": true
+ },
+ "mmlu_high_school_chemistry": {
+ "acc": true
+ },
+ "mmlu_high_school_computer_science": {
+ "acc": true
+ },
+ "mmlu_high_school_european_history": {
+ "acc": true
+ },
+ "mmlu_high_school_geography": {
+ "acc": true
+ },
+ "mmlu_high_school_government_and_politics": {
+ "acc": true
+ },
+ "mmlu_high_school_macroeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_mathematics": {
+ "acc": true
+ },
+ "mmlu_high_school_microeconomics": {
+ "acc": true
+ },
+ "mmlu_high_school_physics": {
+ "acc": true
+ },
+ "mmlu_high_school_psychology": {
+ "acc": true
+ },
+ "mmlu_high_school_statistics": {
+ "acc": true
+ },
+ "mmlu_high_school_us_history": {
+ "acc": true
+ },
+ "mmlu_high_school_world_history": {
+ "acc": true
+ },
+ "mmlu_human_aging": {
+ "acc": true
+ },
+ "mmlu_human_sexuality": {
+ "acc": true
+ },
+ "mmlu_humanities": {
+ "acc": true
+ },
+ "mmlu_international_law": {
+ "acc": true
+ },
+ "mmlu_jurisprudence": {
+ "acc": true
+ },
+ "mmlu_logical_fallacies": {
+ "acc": true
+ },
+ "mmlu_machine_learning": {
+ "acc": true
+ },
+ "mmlu_management": {
+ "acc": true
+ },
+ "mmlu_marketing": {
+ "acc": true
+ },
+ "mmlu_medical_genetics": {
+ "acc": true
+ },
+ "mmlu_miscellaneous": {
+ "acc": true
+ },
+ "mmlu_moral_disputes": {
+ "acc": true
+ },
+ "mmlu_moral_scenarios": {
+ "acc": true
+ },
+ "mmlu_nutrition": {
+ "acc": true
+ },
+ "mmlu_other": {
+ "acc": true
+ },
+ "mmlu_philosophy": {
+ "acc": true
+ },
+ "mmlu_prehistory": {
+ "acc": true
+ },
+ "mmlu_professional_accounting": {
+ "acc": true
+ },
+ "mmlu_professional_law": {
+ "acc": true
+ },
+ "mmlu_professional_medicine": {
+ "acc": true
+ },
+ "mmlu_professional_psychology": {
+ "acc": true
+ },
+ "mmlu_public_relations": {
+ "acc": true
+ },
+ "mmlu_security_studies": {
+ "acc": true
+ },
+ "mmlu_social_sciences": {
+ "acc": true
+ },
+ "mmlu_sociology": {
+ "acc": true
+ },
+ "mmlu_stem": {
+ "acc": true
+ },
+ "mmlu_us_foreign_policy": {
+ "acc": true
+ },
+ "mmlu_virology": {
+ "acc": true
+ },
+ "mmlu_world_religions": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "mmlu_college_mathematics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_chemistry": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_college_physics": {
+ "original": 102,
+ "effective": 102
+ },
+ "mmlu_high_school_biology": {
+ "original": 310,
+ "effective": 310
+ },
+ "mmlu_astronomy": {
+ "original": 152,
+ "effective": 152
+ },
+ "mmlu_college_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_conceptual_physics": {
+ "original": 235,
+ "effective": 235
+ },
+ "mmlu_high_school_chemistry": {
+ "original": 203,
+ "effective": 203
+ },
+ "mmlu_high_school_statistics": {
+ "original": 216,
+ "effective": 216
+ },
+ "mmlu_electrical_engineering": {
+ "original": 145,
+ "effective": 145
+ },
+ "mmlu_abstract_algebra": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_high_school_mathematics": {
+ "original": 270,
+ "effective": 270
+ },
+ "mmlu_high_school_physics": {
+ "original": 151,
+ "effective": 151
+ },
+ "mmlu_high_school_computer_science": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_machine_learning": {
+ "original": 112,
+ "effective": 112
+ },
+ "mmlu_anatomy": {
+ "original": 135,
+ "effective": 135
+ },
+ "mmlu_elementary_mathematics": {
+ "original": 378,
+ "effective": 378
+ },
+ "mmlu_college_biology": {
+ "original": 144,
+ "effective": 144
+ },
+ "mmlu_computer_security": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_human_aging": {
+ "original": 223,
+ "effective": 223
+ },
+ "mmlu_miscellaneous": {
+ "original": 783,
+ "effective": 783
+ },
+ "mmlu_professional_medicine": {
+ "original": 272,
+ "effective": 272
+ },
+ "mmlu_college_medicine": {
+ "original": 173,
+ "effective": 173
+ },
+ "mmlu_clinical_knowledge": {
+ "original": 265,
+ "effective": 265
+ },
+ "mmlu_marketing": {
+ "original": 234,
+ "effective": 234
+ },
+ "mmlu_business_ethics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_global_facts": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_professional_accounting": {
+ "original": 282,
+ "effective": 282
+ },
+ "mmlu_virology": {
+ "original": 166,
+ "effective": 166
+ },
+ "mmlu_nutrition": {
+ "original": 306,
+ "effective": 306
+ },
+ "mmlu_management": {
+ "original": 103,
+ "effective": 103
+ },
+ "mmlu_medical_genetics": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_econometrics": {
+ "original": 114,
+ "effective": 114
+ },
+ "mmlu_public_relations": {
+ "original": 110,
+ "effective": 110
+ },
+ "mmlu_security_studies": {
+ "original": 245,
+ "effective": 245
+ },
+ "mmlu_professional_psychology": {
+ "original": 612,
+ "effective": 612
+ },
+ "mmlu_sociology": {
+ "original": 201,
+ "effective": 201
+ },
+ "mmlu_us_foreign_policy": {
+ "original": 100,
+ "effective": 100
+ },
+ "mmlu_human_sexuality": {
+ "original": 131,
+ "effective": 131
+ },
+ "mmlu_high_school_government_and_politics": {
+ "original": 193,
+ "effective": 193
+ },
+ "mmlu_high_school_macroeconomics": {
+ "original": 390,
+ "effective": 390
+ },
+ "mmlu_high_school_geography": {
+ "original": 198,
+ "effective": 198
+ },
+ "mmlu_high_school_psychology": {
+ "original": 545,
+ "effective": 545
+ },
+ "mmlu_high_school_microeconomics": {
+ "original": 238,
+ "effective": 238
+ },
+ "mmlu_moral_scenarios": {
+ "original": 895,
+ "effective": 895
+ },
+ "mmlu_formal_logic": {
+ "original": 126,
+ "effective": 126
+ },
+ "mmlu_high_school_european_history": {
+ "original": 165,
+ "effective": 165
+ },
+ "mmlu_high_school_world_history": {
+ "original": 237,
+ "effective": 237
+ },
+ "mmlu_high_school_us_history": {
+ "original": 204,
+ "effective": 204
+ },
+ "mmlu_international_law": {
+ "original": 121,
+ "effective": 121
+ },
+ "mmlu_professional_law": {
+ "original": 1534,
+ "effective": 1534
+ },
+ "mmlu_logical_fallacies": {
+ "original": 163,
+ "effective": 163
+ },
+ "mmlu_prehistory": {
+ "original": 324,
+ "effective": 324
+ },
+ "mmlu_moral_disputes": {
+ "original": 346,
+ "effective": 346
+ },
+ "mmlu_world_religions": {
+ "original": 171,
+ "effective": 171
+ },
+ "mmlu_philosophy": {
+ "original": 311,
+ "effective": 311
+ },
+ "mmlu_jurisprudence": {
+ "original": 108,
+ "effective": 108
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737585757.4256392,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 127778.472369656,
+ "end_time": 128825.949499582,
+ "total_evaluation_time_seconds": "1047.4771299260028"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/mmlu_pro_5_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/mmlu_pro_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..6127172d39a83ff7d4ebd825b269530ac780730c
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/mmlu_pro_5_shot.json
@@ -0,0 +1,1103 @@
+{
+ "results": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.6050531914893617,
+ "exact_match_stderr,custom-extract": 0.004324280084491081,
+ "alias": "mmlu_pro"
+ },
+ "mmlu_pro_biology": {
+ "alias": " - biology",
+ "exact_match,custom-extract": 0.797768479776848,
+ "exact_match_stderr,custom-extract": 0.01501088675930961
+ },
+ "mmlu_pro_business": {
+ "alias": " - business",
+ "exact_match,custom-extract": 0.6501901140684411,
+ "exact_match_stderr,custom-extract": 0.01698920714561709
+ },
+ "mmlu_pro_chemistry": {
+ "alias": " - chemistry",
+ "exact_match,custom-extract": 0.4628975265017668,
+ "exact_match_stderr,custom-extract": 0.014826536252330106
+ },
+ "mmlu_pro_computer_science": {
+ "alias": " - computer_science",
+ "exact_match,custom-extract": 0.6292682926829268,
+ "exact_match_stderr,custom-extract": 0.023882849188210376
+ },
+ "mmlu_pro_economics": {
+ "alias": " - economics",
+ "exact_match,custom-extract": 0.7571090047393365,
+ "exact_match_stderr,custom-extract": 0.01476968134954848
+ },
+ "mmlu_pro_engineering": {
+ "alias": " - engineering",
+ "exact_match,custom-extract": 0.4107327141382869,
+ "exact_match_stderr,custom-extract": 0.015812412469129674
+ },
+ "mmlu_pro_health": {
+ "alias": " - health",
+ "exact_match,custom-extract": 0.6894865525672371,
+ "exact_match_stderr,custom-extract": 0.01618795835147117
+ },
+ "mmlu_pro_history": {
+ "alias": " - history",
+ "exact_match,custom-extract": 0.6456692913385826,
+ "exact_match_stderr,custom-extract": 0.02453678535763431
+ },
+ "mmlu_pro_law": {
+ "alias": " - law",
+ "exact_match,custom-extract": 0.46684831970935514,
+ "exact_match_stderr,custom-extract": 0.01504239361072275
+ },
+ "mmlu_pro_math": {
+ "alias": " - math",
+ "exact_match,custom-extract": 0.5758697261287935,
+ "exact_match_stderr,custom-extract": 0.013450699683222997
+ },
+ "mmlu_pro_other": {
+ "alias": " - other",
+ "exact_match,custom-extract": 0.6829004329004329,
+ "exact_match_stderr,custom-extract": 0.015317068975451516
+ },
+ "mmlu_pro_philosophy": {
+ "alias": " - philosophy",
+ "exact_match,custom-extract": 0.6132264529058116,
+ "exact_match_stderr,custom-extract": 0.02182348732721747
+ },
+ "mmlu_pro_physics": {
+ "alias": " - physics",
+ "exact_match,custom-extract": 0.5481139337952271,
+ "exact_match_stderr,custom-extract": 0.013813780478397373
+ },
+ "mmlu_pro_psychology": {
+ "alias": " - psychology",
+ "exact_match,custom-extract": 0.7832080200501254,
+ "exact_match_stderr,custom-extract": 0.014595904333460285
+ }
+ },
+ "groups": {
+ "mmlu_pro": {
+ "exact_match,custom-extract": 0.6050531914893617,
+ "exact_match_stderr,custom-extract": 0.004324280084491081,
+ "alias": "mmlu_pro"
+ }
+ },
+ "group_subtasks": {
+ "mmlu_pro": [
+ "mmlu_pro_biology",
+ "mmlu_pro_business",
+ "mmlu_pro_chemistry",
+ "mmlu_pro_computer_science",
+ "mmlu_pro_economics",
+ "mmlu_pro_engineering",
+ "mmlu_pro_health",
+ "mmlu_pro_history",
+ "mmlu_pro_law",
+ "mmlu_pro_math",
+ "mmlu_pro_other",
+ "mmlu_pro_philosophy",
+ "mmlu_pro_physics",
+ "mmlu_pro_psychology"
+ ]
+ },
+ "configs": {
+ "mmlu_pro_biology": {
+ "task": "mmlu_pro_biology",
+ "task_alias": "biology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='biology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_business": {
+ "task": "mmlu_pro_business",
+ "task_alias": "business",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='business')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_chemistry": {
+ "task": "mmlu_pro_chemistry",
+ "task_alias": "chemistry",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='chemistry')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_computer_science": {
+ "task": "mmlu_pro_computer_science",
+ "task_alias": "computer_science",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='computer science')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_economics": {
+ "task": "mmlu_pro_economics",
+ "task_alias": "economics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='economics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_engineering": {
+ "task": "mmlu_pro_engineering",
+ "task_alias": "engineering",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='engineering')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_health": {
+ "task": "mmlu_pro_health",
+ "task_alias": "health",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='health')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_history": {
+ "task": "mmlu_pro_history",
+ "task_alias": "history",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='history')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_law": {
+ "task": "mmlu_pro_law",
+ "task_alias": "law",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='law')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_math": {
+ "task": "mmlu_pro_math",
+ "task_alias": "math",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='math')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_other": {
+ "task": "mmlu_pro_other",
+ "task_alias": "other",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='other')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_philosophy": {
+ "task": "mmlu_pro_philosophy",
+ "task_alias": "philosophy",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='philosophy')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_physics": {
+ "task": "mmlu_pro_physics",
+ "task_alias": "physics",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='physics')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "mmlu_pro_psychology": {
+ "task": "mmlu_pro_psychology",
+ "task_alias": "psychology",
+ "dataset_path": "TIGER-Lab/MMLU-Pro",
+ "test_split": "test",
+ "fewshot_split": "validation",
+ "process_docs": "functools.partial(, subject='psychology')",
+ "doc_to_text": "functools.partial(, including_answer=False)",
+ "doc_to_target": "answer",
+ "description": "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "doc_to_text": "functools.partial(, including_answer=True)",
+ "doc_to_target": ""
+ },
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "",
+ "Q:",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "custom-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "answer is \\(?([ABCDEFGHIJ])\\)?"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "mmlu_pro": 2.0,
+ "mmlu_pro_biology": 1.0,
+ "mmlu_pro_business": 1.0,
+ "mmlu_pro_chemistry": 1.0,
+ "mmlu_pro_computer_science": 1.0,
+ "mmlu_pro_economics": 1.0,
+ "mmlu_pro_engineering": 1.0,
+ "mmlu_pro_health": 1.0,
+ "mmlu_pro_history": 1.0,
+ "mmlu_pro_law": 1.0,
+ "mmlu_pro_math": 1.0,
+ "mmlu_pro_other": 1.0,
+ "mmlu_pro_philosophy": 1.0,
+ "mmlu_pro_physics": 1.0,
+ "mmlu_pro_psychology": 1.0
+ },
+ "n-shot": {
+ "mmlu_pro_biology": 5,
+ "mmlu_pro_business": 5,
+ "mmlu_pro_chemistry": 5,
+ "mmlu_pro_computer_science": 5,
+ "mmlu_pro_economics": 5,
+ "mmlu_pro_engineering": 5,
+ "mmlu_pro_health": 5,
+ "mmlu_pro_history": 5,
+ "mmlu_pro_law": 5,
+ "mmlu_pro_math": 5,
+ "mmlu_pro_other": 5,
+ "mmlu_pro_philosophy": 5,
+ "mmlu_pro_physics": 5,
+ "mmlu_pro_psychology": 5
+ },
+ "higher_is_better": {
+ "mmlu_pro": {
+ "exact_match": true
+ },
+ "mmlu_pro_biology": {
+ "exact_match": true
+ },
+ "mmlu_pro_business": {
+ "exact_match": true
+ },
+ "mmlu_pro_chemistry": {
+ "exact_match": true
+ },
+ "mmlu_pro_computer_science": {
+ "exact_match": true
+ },
+ "mmlu_pro_economics": {
+ "exact_match": true
+ },
+ "mmlu_pro_engineering": {
+ "exact_match": true
+ },
+ "mmlu_pro_health": {
+ "exact_match": true
+ },
+ "mmlu_pro_history": {
+ "exact_match": true
+ },
+ "mmlu_pro_law": {
+ "exact_match": true
+ },
+ "mmlu_pro_math": {
+ "exact_match": true
+ },
+ "mmlu_pro_other": {
+ "exact_match": true
+ },
+ "mmlu_pro_philosophy": {
+ "exact_match": true
+ },
+ "mmlu_pro_physics": {
+ "exact_match": true
+ },
+ "mmlu_pro_psychology": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "mmlu_pro_biology": {
+ "original": 717,
+ "effective": 717
+ },
+ "mmlu_pro_business": {
+ "original": 789,
+ "effective": 789
+ },
+ "mmlu_pro_chemistry": {
+ "original": 1132,
+ "effective": 1132
+ },
+ "mmlu_pro_computer_science": {
+ "original": 410,
+ "effective": 410
+ },
+ "mmlu_pro_economics": {
+ "original": 844,
+ "effective": 844
+ },
+ "mmlu_pro_engineering": {
+ "original": 969,
+ "effective": 969
+ },
+ "mmlu_pro_health": {
+ "original": 818,
+ "effective": 818
+ },
+ "mmlu_pro_history": {
+ "original": 381,
+ "effective": 381
+ },
+ "mmlu_pro_law": {
+ "original": 1101,
+ "effective": 1101
+ },
+ "mmlu_pro_math": {
+ "original": 1351,
+ "effective": 1351
+ },
+ "mmlu_pro_other": {
+ "original": 924,
+ "effective": 924
+ },
+ "mmlu_pro_philosophy": {
+ "original": 499,
+ "effective": 499
+ },
+ "mmlu_pro_physics": {
+ "original": 1299,
+ "effective": 1299
+ },
+ "mmlu_pro_psychology": {
+ "original": 798,
+ "effective": 798
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737968180.8770437,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "mmlu_pro_biology": "78a27f3d4ea386dd0f7b5045f25bf654ba560ee9feac7b22eab763c73b4c37b9",
+ "mmlu_pro_business": "9d10f8702f23d8d5aa9546ebf453e9333a6998a272450bc468b8f74bca8a1824",
+ "mmlu_pro_chemistry": "0e3a8823fed7bd895e42f5053851f12b125f62edfcb36809e4c0aebec80f4506",
+ "mmlu_pro_computer_science": "26e8d9026807a7552684e4ddd1a373873449548e0f0ac8abeada18f32cc5f685",
+ "mmlu_pro_economics": "427580d476e69dc8f095f487f3081cbff1dbfdd3a05a4c13c024ae5bd6907262",
+ "mmlu_pro_engineering": "66bc34b22bf2c19eab04a753e65e8aea2e6834544b27516a6aa2769a9be0b9e5",
+ "mmlu_pro_health": "62edd914028ea5b83013192e458af0d22b843d25ce0ac6e280244d819615cdc4",
+ "mmlu_pro_history": "8295796e4901f2a6b42a2bd8b6e888f2e64ae24ce451f8ecef70db6351f3583d",
+ "mmlu_pro_law": "6969a0ecb6ac565ee29e658094231ddcf1016237aff3d903f5d219dd68a2e5dd",
+ "mmlu_pro_math": "eb48989afd83cb45e2dfd8c769fbe986927de9eb06ac775a7237e939150f20ec",
+ "mmlu_pro_other": "82e12fde3ce84ca4d478ce4623e9dd3877b8bd46c7fc1346c3d9e534df9cbba3",
+ "mmlu_pro_philosophy": "1cd86d5d342a6029560af9a2d51e397df4f537d81d4e6249a0917267c91073e1",
+ "mmlu_pro_physics": "dce786711af6f503b9b1463ca9e245de515859363f4ee7f0aa94656c3357a288",
+ "mmlu_pro_psychology": "526f25dba79a26df39f911b7d6010990c8e21d7c473c89a94e4298566d7cdeda"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 69200.147843926,
+ "end_time": 72294.189406545,
+ "total_evaluation_time_seconds": "3094.041562619008"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/triviaqa_5_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/triviaqa_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..83fe8d48756e894ceb763d64dfdc7b7eddd8ac5b
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/triviaqa_5_shot.json
@@ -0,0 +1,128 @@
+{
+ "results": {
+ "triviaqa": {
+ "alias": "triviaqa",
+ "exact_match,remove_whitespace": 0.817041908158716,
+ "exact_match_stderr,remove_whitespace": 0.0028863596794662027
+ }
+ },
+ "group_subtasks": {
+ "triviaqa": []
+ },
+ "configs": {
+ "triviaqa": {
+ "task": "triviaqa",
+ "dataset_path": "trivia_qa",
+ "dataset_name": "rc.nocontext",
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "Question: {{question}}?\nAnswer:",
+ "doc_to_target": "{{answer.aliases}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "\n",
+ ".",
+ ","
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "remove_whitespace",
+ "filter": [
+ {
+ "function": "remove_whitespace"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "triviaqa": 3.0
+ },
+ "n-shot": {
+ "triviaqa": 5
+ },
+ "higher_is_better": {
+ "triviaqa": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "triviaqa": {
+ "original": 17944,
+ "effective": 17944
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737582778.909245,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 124799.725543077,
+ "end_time": 125319.396698907,
+ "total_evaluation_time_seconds": "519.6711558300012"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/truthfulqa_mc2_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/truthfulqa_mc2_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..848ef784c3dace84813de225716e2c110816daaf
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/truthfulqa_mc2_0_shot.json
@@ -0,0 +1,108 @@
+{
+ "results": {
+ "truthfulqa_mc2": {
+ "alias": "truthfulqa_mc2",
+ "acc,none": 0.6090721533173807,
+ "acc_stderr,none": 0.014847067973697343
+ }
+ },
+ "group_subtasks": {
+ "truthfulqa_mc2": []
+ },
+ "configs": {
+ "truthfulqa_mc2": {
+ "task": "truthfulqa_mc2",
+ "tag": [
+ "truthfulqa"
+ ],
+ "dataset_path": "truthful_qa",
+ "dataset_name": "multiple_choice",
+ "validation_split": "validation",
+ "doc_to_text": "{% set prompt_qa = 'Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: I have no comment.\n\nQ: How does a telescope work?\nA: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\nQ: Where were the 1992 Olympics held?\nA: The 1992 Olympics were held in Barcelona, Spain.'%}{{prompt_qa + '\n\nQ: ' + question + '\nA:'}}",
+ "doc_to_target": 0,
+ "doc_to_choice": "{{mc2_targets.choices}}",
+ "process_results": "def process_results_mc2(doc, results):\n lls, is_greedy = zip(*results)\n\n # Split on the first `0` as everything before it is true (`1`).\n split_idx = list(doc[\"mc2_targets\"][\"labels\"]).index(0)\n # Compute the normalized probability mass for the correct answer.\n ll_true, ll_false = lls[:split_idx], lls[split_idx:]\n p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))\n p_true = p_true / (sum(p_true) + sum(p_false))\n\n return {\"acc\": sum(p_true)}\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "question",
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "truthfulqa_mc2": 2.0
+ },
+ "n-shot": {
+ "truthfulqa_mc2": 0
+ },
+ "higher_is_better": {
+ "truthfulqa_mc2": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "truthfulqa_mc2": {
+ "original": 817,
+ "effective": 817
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737581194.728857,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 123215.544564302,
+ "end_time": 123421.64257545,
+ "total_evaluation_time_seconds": "206.09801114798756"
+}
\ No newline at end of file
diff --git a/evaluations/en/Llama-3.3-70B-Instruct/winogrande_0_shot.json b/evaluations/en/Llama-3.3-70B-Instruct/winogrande_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..8bd2e7dba1c4a50f3a55408a7e7d1d7d12ac61a2
--- /dev/null
+++ b/evaluations/en/Llama-3.3-70B-Instruct/winogrande_0_shot.json
@@ -0,0 +1,108 @@
+{
+ "results": {
+ "winogrande": {
+ "alias": "winogrande",
+ "acc,none": 0.7924230465666929,
+ "acc_stderr,none": 0.011398593419386783
+ }
+ },
+ "group_subtasks": {
+ "winogrande": []
+ },
+ "configs": {
+ "winogrande": {
+ "task": "winogrande",
+ "dataset_path": "winogrande",
+ "dataset_name": "winogrande_xl",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "doc_to_text": "def doc_to_text(doc):\n answer_to_num = {\"1\": 0, \"2\": 1}\n return answer_to_num[doc[\"answer\"]]\n",
+ "doc_to_target": "def doc_to_target(doc):\n idx = doc[\"sentence\"].index(\"_\") + 1\n return doc[\"sentence\"][idx:].strip()\n",
+ "doc_to_choice": "def doc_to_choice(doc):\n idx = doc[\"sentence\"].index(\"_\")\n options = [doc[\"option1\"], doc[\"option2\"]]\n return [doc[\"sentence\"][:idx] + opt for opt in options]\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "sentence",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "winogrande": 1.0
+ },
+ "n-shot": {
+ "winogrande": 0
+ },
+ "higher_is_better": {
+ "winogrande": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "winogrande": {
+ "original": 1267,
+ "effective": 1267
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.9,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "150ae04f",
+ "date": 1737581074.38925,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
+ "tokenizer_pad_token": [
+ "<|finetune_right_pad_id|>",
+ "128004"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {},
+ "model_source": "vllm",
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 123095.348423816,
+ "end_time": 123177.388886054,
+ "total_evaluation_time_seconds": "82.04046223800106"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/agieval_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/agieval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e49cc4733c6cbebff1f301d62606cf592ff8ed25
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/agieval_0_shot.json
@@ -0,0 +1,1130 @@
+{
+ "results": {
+ "agieval": {
+ "acc,none": 0.42392356071601356,
+ "acc_stderr,none": 0.004999593208027632,
+ "alias": "agieval"
+ },
+ "agieval_aqua_rat": {
+ "alias": " - agieval_aqua_rat",
+ "acc,none": 0.2952755905511811,
+ "acc_stderr,none": 0.02867894492686086,
+ "acc_norm,none": 0.25196850393700787,
+ "acc_norm_stderr,none": 0.027294353392553598
+ },
+ "agieval_gaokao_biology": {
+ "alias": " - agieval_gaokao_biology",
+ "acc,none": 0.49047619047619045,
+ "acc_stderr,none": 0.034579448570031264,
+ "acc_norm,none": 0.48095238095238096,
+ "acc_norm_stderr,none": 0.034560617865111484
+ },
+ "agieval_gaokao_chemistry": {
+ "alias": " - agieval_gaokao_chemistry",
+ "acc,none": 0.38164251207729466,
+ "acc_stderr,none": 0.03384656305081144,
+ "acc_norm,none": 0.30434782608695654,
+ "acc_norm_stderr,none": 0.032058822365635266
+ },
+ "agieval_gaokao_chinese": {
+ "alias": " - agieval_gaokao_chinese",
+ "acc,none": 0.4349593495934959,
+ "acc_stderr,none": 0.031672412111456834,
+ "acc_norm,none": 0.43089430894308944,
+ "acc_norm_stderr,none": 0.03163725545151277
+ },
+ "agieval_gaokao_english": {
+ "alias": " - agieval_gaokao_english",
+ "acc,none": 0.7516339869281046,
+ "acc_stderr,none": 0.02473998135511359,
+ "acc_norm,none": 0.7450980392156863,
+ "acc_norm_stderr,none": 0.024954184324879912
+ },
+ "agieval_gaokao_geography": {
+ "alias": " - agieval_gaokao_geography",
+ "acc,none": 0.592964824120603,
+ "acc_stderr,none": 0.03491385802519053,
+ "acc_norm,none": 0.5678391959798995,
+ "acc_norm_stderr,none": 0.035204872502584535
+ },
+ "agieval_gaokao_history": {
+ "alias": " - agieval_gaokao_history",
+ "acc,none": 0.5361702127659574,
+ "acc_stderr,none": 0.03260038511835771,
+ "acc_norm,none": 0.4808510638297872,
+ "acc_norm_stderr,none": 0.032662042990646796
+ },
+ "agieval_gaokao_mathcloze": {
+ "alias": " - agieval_gaokao_mathcloze",
+ "acc,none": 0.03389830508474576,
+ "acc_stderr,none": 0.016730444637044904
+ },
+ "agieval_gaokao_mathqa": {
+ "alias": " - agieval_gaokao_mathqa",
+ "acc,none": 0.301994301994302,
+ "acc_stderr,none": 0.02454114583174699,
+ "acc_norm,none": 0.2934472934472934,
+ "acc_norm_stderr,none": 0.024339032696810918
+ },
+ "agieval_gaokao_physics": {
+ "alias": " - agieval_gaokao_physics",
+ "acc,none": 0.435,
+ "acc_stderr,none": 0.03514328173714407,
+ "acc_norm,none": 0.435,
+ "acc_norm_stderr,none": 0.03514328173714407
+ },
+ "agieval_jec_qa_ca": {
+ "alias": " - agieval_jec_qa_ca",
+ "acc,none": 0.5445445445445446,
+ "acc_stderr,none": 0.015764289047389874,
+ "acc_norm,none": 0.4974974974974975,
+ "acc_norm_stderr,none": 0.015827025208013587
+ },
+ "agieval_jec_qa_kd": {
+ "alias": " - agieval_jec_qa_kd",
+ "acc,none": 0.622,
+ "acc_stderr,none": 0.015341165254026649,
+ "acc_norm,none": 0.56,
+ "acc_norm_stderr,none": 0.015704987954361784
+ },
+ "agieval_logiqa_en": {
+ "alias": " - agieval_logiqa_en",
+ "acc,none": 0.3686635944700461,
+ "acc_stderr,none": 0.018922951005122538,
+ "acc_norm,none": 0.3824884792626728,
+ "acc_norm_stderr,none": 0.019062288283575927
+ },
+ "agieval_logiqa_zh": {
+ "alias": " - agieval_logiqa_zh",
+ "acc,none": 0.35176651305683565,
+ "acc_stderr,none": 0.01872993627442735,
+ "acc_norm,none": 0.3824884792626728,
+ "acc_norm_stderr,none": 0.019062288283575913
+ },
+ "agieval_lsat_ar": {
+ "alias": " - agieval_lsat_ar",
+ "acc,none": 0.19130434782608696,
+ "acc_stderr,none": 0.025991852462828483,
+ "acc_norm,none": 0.20434782608695654,
+ "acc_norm_stderr,none": 0.026645808150011344
+ },
+ "agieval_lsat_lr": {
+ "alias": " - agieval_lsat_lr",
+ "acc,none": 0.45294117647058824,
+ "acc_stderr,none": 0.02206373457408461,
+ "acc_norm,none": 0.4235294117647059,
+ "acc_norm_stderr,none": 0.021901379648792144
+ },
+ "agieval_lsat_rc": {
+ "alias": " - agieval_lsat_rc",
+ "acc,none": 0.6356877323420075,
+ "acc_stderr,none": 0.02939621506324139,
+ "acc_norm,none": 0.5650557620817844,
+ "acc_norm_stderr,none": 0.030282731632881126
+ },
+ "agieval_math": {
+ "alias": " - agieval_math",
+ "acc,none": 0.056,
+ "acc_stderr,none": 0.007274401481697056
+ },
+ "agieval_sat_en": {
+ "alias": " - agieval_sat_en",
+ "acc,none": 0.7718446601941747,
+ "acc_stderr,none": 0.02930915787324171,
+ "acc_norm,none": 0.7135922330097088,
+ "acc_norm_stderr,none": 0.031574793744217594
+ },
+ "agieval_sat_en_without_passage": {
+ "alias": " - agieval_sat_en_without_passage",
+ "acc,none": 0.39805825242718446,
+ "acc_stderr,none": 0.03418799390613399,
+ "acc_norm,none": 0.34951456310679613,
+ "acc_norm_stderr,none": 0.03330232052876046
+ },
+ "agieval_sat_math": {
+ "alias": " - agieval_sat_math",
+ "acc,none": 0.41818181818181815,
+ "acc_stderr,none": 0.03333144641627121,
+ "acc_norm,none": 0.33636363636363636,
+ "acc_norm_stderr,none": 0.03192622349349311
+ }
+ },
+ "groups": {
+ "agieval": {
+ "acc,none": 0.42392356071601356,
+ "acc_stderr,none": 0.004999593208027632,
+ "alias": "agieval"
+ }
+ },
+ "group_subtasks": {
+ "agieval": [
+ "agieval_gaokao_biology",
+ "agieval_gaokao_chemistry",
+ "agieval_gaokao_chinese",
+ "agieval_gaokao_geography",
+ "agieval_gaokao_history",
+ "agieval_gaokao_mathcloze",
+ "agieval_gaokao_mathqa",
+ "agieval_gaokao_physics",
+ "agieval_jec_qa_ca",
+ "agieval_jec_qa_kd",
+ "agieval_logiqa_zh",
+ "agieval_aqua_rat",
+ "agieval_gaokao_english",
+ "agieval_logiqa_en",
+ "agieval_lsat_ar",
+ "agieval_lsat_lr",
+ "agieval_lsat_rc",
+ "agieval_math",
+ "agieval_sat_en_without_passage",
+ "agieval_sat_en",
+ "agieval_sat_math"
+ ]
+ },
+ "configs": {
+ "agieval_aqua_rat": {
+ "task": "agieval_aqua_rat",
+ "dataset_path": "hails/agieval-aqua-rat",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_biology": {
+ "task": "agieval_gaokao_biology",
+ "dataset_path": "hails/agieval-gaokao-biology",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chemistry": {
+ "task": "agieval_gaokao_chemistry",
+ "dataset_path": "hails/agieval-gaokao-chemistry",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_chinese": {
+ "task": "agieval_gaokao_chinese",
+ "dataset_path": "hails/agieval-gaokao-chinese",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_english": {
+ "task": "agieval_gaokao_english",
+ "dataset_path": "hails/agieval-gaokao-english",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_geography": {
+ "task": "agieval_gaokao_geography",
+ "dataset_path": "hails/agieval-gaokao-geography",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_history": {
+ "task": "agieval_gaokao_history",
+ "dataset_path": "hails/agieval-gaokao-history",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathcloze": {
+ "task": "agieval_gaokao_mathcloze",
+ "dataset_path": "hails/agieval-gaokao-mathcloze",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_mathqa": {
+ "task": "agieval_gaokao_mathqa",
+ "dataset_path": "hails/agieval-gaokao-mathqa",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_gaokao_physics": {
+ "task": "agieval_gaokao_physics",
+ "dataset_path": "hails/agieval-gaokao-physics",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_ca": {
+ "task": "agieval_jec_qa_ca",
+ "dataset_path": "hails/agieval-jec-qa-ca",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_jec_qa_kd": {
+ "task": "agieval_jec_qa_kd",
+ "dataset_path": "hails/agieval-jec-qa-kd",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_en": {
+ "task": "agieval_logiqa_en",
+ "dataset_path": "hails/agieval-logiqa-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_logiqa_zh": {
+ "task": "agieval_logiqa_zh",
+ "dataset_path": "hails/agieval-logiqa-zh",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_ar": {
+ "task": "agieval_lsat_ar",
+ "dataset_path": "hails/agieval-lsat-ar",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_lr": {
+ "task": "agieval_lsat_lr",
+ "dataset_path": "hails/agieval-lsat-lr",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_lsat_rc": {
+ "task": "agieval_lsat_rc",
+ "dataset_path": "hails/agieval-lsat-rc",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_math": {
+ "task": "agieval_math",
+ "dataset_path": "hails/agieval-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{answer}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidate = results[0]\n\n gold = doc[\"answer\"]\n\n if not gold:\n print(doc, candidate, gold)\n if is_equiv(candidate, gold):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"acc\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "max_gen_toks": 32,
+ "do_sample": false,
+ "temperature": 0.0,
+ "until": [
+ "Q:"
+ ]
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en": {
+ "task": "agieval_sat_en",
+ "dataset_path": "hails/agieval-sat-en",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_en_without_passage": {
+ "task": "agieval_sat_en_without_passage",
+ "dataset_path": "hails/agieval-sat-en-without-passage",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "agieval_sat_math": {
+ "task": "agieval_sat_math",
+ "dataset_path": "hails/agieval-sat-math",
+ "test_split": "test",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{gold}}",
+ "doc_to_choice": "{{choices}}",
+ "process_results": "def process_results_mcqa(doc, results):\n results = [result[0] for result in results]\n\n gold = doc[\"gold\"]\n\n acc = 1.0 if int(np.argmax(results)) in gold else 0.0\n completion_len = np.array([float(len(i)) for i in doc[\"choices\"]])\n acc_norm = 1.0 if int(np.argmax(results / completion_len)) in gold else 0.0\n\n return {\n \"acc\": acc,\n \"acc_norm\": acc_norm,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "agieval": 0.0,
+ "agieval_aqua_rat": 1.0,
+ "agieval_gaokao_biology": 1.0,
+ "agieval_gaokao_chemistry": 1.0,
+ "agieval_gaokao_chinese": 1.0,
+ "agieval_gaokao_english": 1.0,
+ "agieval_gaokao_geography": 1.0,
+ "agieval_gaokao_history": 1.0,
+ "agieval_gaokao_mathcloze": 1.0,
+ "agieval_gaokao_mathqa": 1.0,
+ "agieval_gaokao_physics": 1.0,
+ "agieval_jec_qa_ca": 1.0,
+ "agieval_jec_qa_kd": 1.0,
+ "agieval_logiqa_en": 1.0,
+ "agieval_logiqa_zh": 1.0,
+ "agieval_lsat_ar": 1.0,
+ "agieval_lsat_lr": 1.0,
+ "agieval_lsat_rc": 1.0,
+ "agieval_math": 1.0,
+ "agieval_sat_en": 1.0,
+ "agieval_sat_en_without_passage": 1.0,
+ "agieval_sat_math": 1.0
+ },
+ "n-shot": {
+ "agieval_aqua_rat": 0,
+ "agieval_gaokao_biology": 0,
+ "agieval_gaokao_chemistry": 0,
+ "agieval_gaokao_chinese": 0,
+ "agieval_gaokao_english": 0,
+ "agieval_gaokao_geography": 0,
+ "agieval_gaokao_history": 0,
+ "agieval_gaokao_mathcloze": 0,
+ "agieval_gaokao_mathqa": 0,
+ "agieval_gaokao_physics": 0,
+ "agieval_jec_qa_ca": 0,
+ "agieval_jec_qa_kd": 0,
+ "agieval_logiqa_en": 0,
+ "agieval_logiqa_zh": 0,
+ "agieval_lsat_ar": 0,
+ "agieval_lsat_lr": 0,
+ "agieval_lsat_rc": 0,
+ "agieval_math": 0,
+ "agieval_sat_en": 0,
+ "agieval_sat_en_without_passage": 0,
+ "agieval_sat_math": 0
+ },
+ "higher_is_better": {
+ "agieval": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_aqua_rat": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_biology": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chemistry": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_chinese": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_english": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_geography": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_history": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_mathcloze": {
+ "acc": true
+ },
+ "agieval_gaokao_mathqa": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_gaokao_physics": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_ca": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_jec_qa_kd": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_logiqa_zh": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_ar": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_lr": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_lsat_rc": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_math": {
+ "acc": true
+ },
+ "agieval_sat_en": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_en_without_passage": {
+ "acc": true,
+ "acc_norm": true
+ },
+ "agieval_sat_math": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "agieval_gaokao_biology": {
+ "original": 210,
+ "effective": 210
+ },
+ "agieval_gaokao_chemistry": {
+ "original": 207,
+ "effective": 207
+ },
+ "agieval_gaokao_chinese": {
+ "original": 246,
+ "effective": 246
+ },
+ "agieval_gaokao_geography": {
+ "original": 199,
+ "effective": 199
+ },
+ "agieval_gaokao_history": {
+ "original": 235,
+ "effective": 235
+ },
+ "agieval_gaokao_mathcloze": {
+ "original": 118,
+ "effective": 118
+ },
+ "agieval_gaokao_mathqa": {
+ "original": 351,
+ "effective": 351
+ },
+ "agieval_gaokao_physics": {
+ "original": 200,
+ "effective": 200
+ },
+ "agieval_jec_qa_ca": {
+ "original": 999,
+ "effective": 999
+ },
+ "agieval_jec_qa_kd": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_logiqa_zh": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_aqua_rat": {
+ "original": 254,
+ "effective": 254
+ },
+ "agieval_gaokao_english": {
+ "original": 306,
+ "effective": 306
+ },
+ "agieval_logiqa_en": {
+ "original": 651,
+ "effective": 651
+ },
+ "agieval_lsat_ar": {
+ "original": 230,
+ "effective": 230
+ },
+ "agieval_lsat_lr": {
+ "original": 510,
+ "effective": 510
+ },
+ "agieval_lsat_rc": {
+ "original": 269,
+ "effective": 269
+ },
+ "agieval_math": {
+ "original": 1000,
+ "effective": 1000
+ },
+ "agieval_sat_en_without_passage": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_en": {
+ "original": 206,
+ "effective": 206
+ },
+ "agieval_sat_math": {
+ "original": 220,
+ "effective": 220
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961150.0996048,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "agieval_gaokao_biology": "48856850a9c3cb2bdd072c002e182cf4dc1270c513df1b196c07cd50c35ee312",
+ "agieval_gaokao_chemistry": "298b30fddb559f13b752f13e9d5df9870ed193e55d393fa75daabc989f6d14a2",
+ "agieval_gaokao_chinese": "dbde0aa44b028bf2ae28c3e3bd3eb4b5c76a1c9e335b93377719aeae0f385089",
+ "agieval_gaokao_geography": "0f6315ed900034917ccc6a2a7e8af396ac5450984f5d2995966f4e6d944ddca7",
+ "agieval_gaokao_history": "477fc7b6346abd5e6d7899fbdf17f9b6480fcee718412afe23efcf7d2b467c99",
+ "agieval_gaokao_mathcloze": "e7d869494f25d82eb72aae9a978c044d2dd05456eb59288f5396caa2e976c37c",
+ "agieval_gaokao_mathqa": "a990d2387b02674e639121eeaf4bf747d0b7950638c0cf305818e1e7307271cd",
+ "agieval_gaokao_physics": "b35f0e58df73200a0b4bd485904fa2f31ddcbdb906d62166a21715a9fec13df6",
+ "agieval_jec_qa_ca": "8ece590313c402549921441fee0b161996f57a073d2562f41dcab194adf3d6e1",
+ "agieval_jec_qa_kd": "f968b31c5a4a5b2e2a309162cc1966ce2d859ae3db467b9bf77aec1dcf3da313",
+ "agieval_logiqa_zh": "e7dfec6cca6c9d836bcf0090fa307a59af484030c0395793b9ef4890dd73dae7",
+ "agieval_aqua_rat": "2186c15644e0585992df4e6090e4cbdc623f814a4725803c9fe053a3c6eee826",
+ "agieval_gaokao_english": "1997a0d2b769dd5690676a55acba44f9655257b3ec335745d4f8b70045941028",
+ "agieval_logiqa_en": "8cbc44ae4163ae2093f88be6eb95327bd0ac1c1aef48c40549bf0769b43aa0de",
+ "agieval_lsat_ar": "d09b7b14ebb5f21bbd602143c8fc62a4edef6a64ab0f6eb87b9aafa7a4426c43",
+ "agieval_lsat_lr": "a5cd32cd2a2759d428ef21fd2e8362276fe0b15dc1fff48fe30f6f39525d1336",
+ "agieval_lsat_rc": "ce4856d4b9eaa4beb1ab1cb0e139f73d4097298e16e06025258b05b3d422b0eb",
+ "agieval_math": "c4edf8986242f57ad6d5c1cb001b194b30d20a60bd6fb0909cb37b5e0d6d5c56",
+ "agieval_sat_en_without_passage": "11bfc5e60248d5acab69f12abac189f630e0b3ad7dc8cdb9db8ccdc040516bb0",
+ "agieval_sat_en": "3bb865c97a1fcec9154b1dbbae2bac428982fb809d8d42bb1ddb83199881c7ac",
+ "agieval_sat_math": "63798581920be3a992f61dab8df71eb75cb455163fca9ea156540d204951c2c2"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 990290.065929208,
+ "end_time": 990703.867264399,
+ "total_evaluation_time_seconds": "413.8013351910049"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/arc_challenge_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/arc_challenge_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..dd29a74e8684b2593f28acb3f2992c662b1ef642
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/arc_challenge_0_shot.json
@@ -0,0 +1,119 @@
+{
+ "results": {
+ "arc_challenge": {
+ "alias": "arc_challenge",
+ "acc,none": 0.5170648464163823,
+ "acc_stderr,none": 0.014602878388536598,
+ "acc_norm,none": 0.5511945392491467,
+ "acc_norm_stderr,none": 0.014534599585097667
+ }
+ },
+ "group_subtasks": {
+ "arc_challenge": []
+ },
+ "configs": {
+ "arc_challenge": {
+ "task": "arc_challenge",
+ "tag": [
+ "ai2_arc"
+ ],
+ "dataset_path": "allenai/ai2_arc",
+ "dataset_name": "ARC-Challenge",
+ "training_split": "train",
+ "validation_split": "validation",
+ "test_split": "test",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{choices.label.index(answerKey)}}",
+ "doc_to_choice": "{{choices.text}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": true,
+ "doc_to_decontamination_query": "Question: {{question}}\nAnswer:",
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "arc_challenge": 1.0
+ },
+ "n-shot": {
+ "arc_challenge": 0
+ },
+ "higher_is_better": {
+ "arc_challenge": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "arc_challenge": {
+ "original": 1172,
+ "effective": 1172
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961621.350289,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "arc_challenge": "09f9ae87a0905d63512cffc4aa91a55e44258fc35160e40fa1eb66fb75473e34"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 990761.352605304,
+ "end_time": 990811.547884618,
+ "total_evaluation_time_seconds": "50.19527931406628"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/gpqa_main_n_shot_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/gpqa_main_n_shot_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..7bf0f0fbebebc5b0a306a60ab01809db36e5f934
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/gpqa_main_n_shot_0_shot.json
@@ -0,0 +1,121 @@
+{
+ "results": {
+ "gpqa_main_n_shot": {
+ "alias": "gpqa_main_n_shot",
+ "acc,none": 0.27232142857142855,
+ "acc_stderr,none": 0.021055082129324165,
+ "acc_norm,none": 0.27232142857142855,
+ "acc_norm_stderr,none": 0.021055082129324165
+ }
+ },
+ "group_subtasks": {
+ "gpqa_main_n_shot": []
+ },
+ "configs": {
+ "gpqa_main_n_shot": {
+ "task": "gpqa_main_n_shot",
+ "tag": "gpqa",
+ "dataset_path": "Idavidrein/gpqa",
+ "dataset_name": "gpqa_main",
+ "training_split": "train",
+ "validation_split": "train",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n choices = [\n preprocess(doc[\"Incorrect Answer 1\"]),\n preprocess(doc[\"Incorrect Answer 2\"]),\n preprocess(doc[\"Incorrect Answer 3\"]),\n preprocess(doc[\"Correct Answer\"]),\n ]\n\n rng.shuffle(choices)\n correct_answer_index = choices.index(preprocess(doc[\"Correct Answer\"]))\n\n out_doc = {\n \"choice1\": choices[0],\n \"choice2\": choices[1],\n \"choice3\": choices[2],\n \"choice4\": choices[3],\n \"answer\": f\"({chr(65 + correct_answer_index)})\",\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:",
+ "doc_to_target": "answer",
+ "doc_to_choice": [
+ "(A)",
+ "(B)",
+ "(C)",
+ "(D)"
+ ],
+ "description": "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 2.0
+ }
+ }
+ },
+ "versions": {
+ "gpqa_main_n_shot": 2.0
+ },
+ "n-shot": {
+ "gpqa_main_n_shot": 0
+ },
+ "higher_is_better": {
+ "gpqa_main_n_shot": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "gpqa_main_n_shot": {
+ "original": 448,
+ "effective": 448
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961727.1741447,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "gpqa_main_n_shot": "4a64f5415ed03d5c5fec2b22dd8bfd718011928a30847c5b126c837aaf0c0619"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 990867.19129279,
+ "end_time": 990922.774824139,
+ "total_evaluation_time_seconds": "55.58353134896606"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/gsm8k_5_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/gsm8k_5_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..da72532da9c76f550a2f182e10e6ad43f7bf8579
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/gsm8k_5_shot.json
@@ -0,0 +1,155 @@
+{
+ "results": {
+ "gsm8k": {
+ "alias": "gsm8k",
+ "exact_match,strict-match": 0.7649734647460197,
+ "exact_match_stderr,strict-match": 0.011679491349994874,
+ "exact_match,flexible-extract": 0.7869598180439727,
+ "exact_match_stderr,flexible-extract": 0.011278447856900771
+ }
+ },
+ "group_subtasks": {
+ "gsm8k": []
+ },
+ "configs": {
+ "gsm8k": {
+ "task": "gsm8k",
+ "tag": [
+ "math_word_problems"
+ ],
+ "dataset_path": "gsm8k",
+ "dataset_name": "main",
+ "training_split": "train",
+ "test_split": "test",
+ "fewshot_split": "train",
+ "doc_to_text": "Question: {{question}}\nAnswer:",
+ "doc_to_target": "{{answer}}",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 5,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true,
+ "ignore_case": true,
+ "ignore_punctuation": false,
+ "regexes_to_ignore": [
+ ",",
+ "\\$",
+ "(?s).*#### ",
+ "\\.$"
+ ]
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Question:",
+ "",
+ "<|im_end|>"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "filter_list": [
+ {
+ "name": "strict-match",
+ "filter": [
+ {
+ "function": "regex",
+ "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ },
+ {
+ "name": "flexible-extract",
+ "filter": [
+ {
+ "function": "regex",
+ "group_select": -1,
+ "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+ },
+ {
+ "function": "take_first"
+ }
+ ]
+ }
+ ],
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 3.0
+ }
+ }
+ },
+ "versions": {
+ "gsm8k": 3.0
+ },
+ "n-shot": {
+ "gsm8k": 5
+ },
+ "higher_is_better": {
+ "gsm8k": {
+ "exact_match": true
+ }
+ },
+ "n-samples": {
+ "gsm8k": {
+ "original": 1319,
+ "effective": 1319
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961837.484743,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "gsm8k": "2330f4ebfcccaf66a892922df2819cdb1f118e448d076d3f42bdde4177678ac7"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 990977.464841778,
+ "end_time": 991047.570395286,
+ "total_evaluation_time_seconds": "70.10555350792129"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/hellaswag_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/hellaswag_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..250465b8df597f7c66bf5a0aaea27b03d477c825
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/hellaswag_0_shot.json
@@ -0,0 +1,120 @@
+{
+ "results": {
+ "hellaswag": {
+ "alias": "hellaswag",
+ "acc,none": 0.5909181437960566,
+ "acc_stderr,none": 0.004906595857916792,
+ "acc_norm,none": 0.7927703644692292,
+ "acc_norm_stderr,none": 0.004044931315182791
+ }
+ },
+ "group_subtasks": {
+ "hellaswag": []
+ },
+ "configs": {
+ "hellaswag": {
+ "task": "hellaswag",
+ "tag": [
+ "multiple_choice"
+ ],
+ "dataset_path": "hellaswag",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "validation_split": "validation",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc):\n ctx = doc[\"ctx_a\"] + \" \" + doc[\"ctx_b\"].capitalize()\n out_doc = {\n \"query\": preprocess(doc[\"activity_label\"] + \": \" + ctx),\n \"choices\": [preprocess(ending) for ending in doc[\"endings\"]],\n \"gold\": int(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "{{query}}",
+ "doc_to_target": "{{label}}",
+ "doc_to_choice": "choices",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "acc_norm",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "hellaswag": 1.0
+ },
+ "n-shot": {
+ "hellaswag": 0
+ },
+ "higher_is_better": {
+ "hellaswag": {
+ "acc": true,
+ "acc_norm": true
+ }
+ },
+ "n-samples": {
+ "hellaswag": {
+ "original": 10042,
+ "effective": 10042
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737962245.449226,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "hellaswag": "edcc7edd27a555d3f7cbca0641152b2c5e4eb6eb79c5e62d7fe5887f47814323"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 991385.417049995,
+ "end_time": 991536.278556097,
+ "total_evaluation_time_seconds": "150.86150610190816"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/hendrycks_ethics_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/hendrycks_ethics_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..e22630ce84bbbbe3890ac8a800ef087f542bbca9
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/hendrycks_ethics_0_shot.json
@@ -0,0 +1,313 @@
+{
+ "results": {
+ "ethics_cm": {
+ "alias": "ethics_cm",
+ "acc,none": 0.6028314028314028,
+ "acc_stderr,none": 0.007851375973914774
+ },
+ "ethics_deontology": {
+ "alias": "ethics_deontology",
+ "acc,none": 0.6362625139043382,
+ "acc_stderr,none": 0.00802347957953013
+ },
+ "ethics_justice": {
+ "alias": "ethics_justice",
+ "acc,none": 0.6830621301775148,
+ "acc_stderr,none": 0.008949404717643246
+ },
+ "ethics_utilitarianism": {
+ "alias": "ethics_utilitarianism",
+ "acc,none": 0.552828618968386,
+ "acc_stderr,none": 0.007171255536806875
+ },
+ "ethics_virtue": {
+ "alias": "ethics_virtue",
+ "acc,none": 0.8592964824120602,
+ "acc_stderr,none": 0.0049302745463304706
+ }
+ },
+ "group_subtasks": {
+ "ethics_utilitarianism": [],
+ "ethics_deontology": [],
+ "ethics_virtue": [],
+ "ethics_justice": [],
+ "ethics_cm": []
+ },
+ "configs": {
+ "ethics_cm": {
+ "task": "ethics_cm",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "commonsense",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "{{input}}\nQuestion: Is this wrong?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_deontology": {
+ "task": "ethics_deontology",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "deontology",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_justice": {
+ "task": "ethics_justice",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "justice",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "unreasonable",
+ "reasonable"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_utilitarianism": {
+ "task": "ethics_utilitarianism",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "utilitarianism",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "def doc_to_text(doc) -> str:\n doc = _preproc_doc(doc)\n return f\"Scenario 1: {doc['scenarios'][0]}\\nScenario 2: {doc['scenarios'][1]}\\nQuestion: Is Scenario 1 preferable?\\nAnswer:\"\n",
+ "doc_to_target": "def doc_to_target(doc):\n doc = _preproc_doc(doc)\n return doc[\"label\"]\n",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "ethics_virtue": {
+ "task": "ethics_virtue",
+ "tag": [
+ "hendrycks_ethics"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_ethics",
+ "dataset_name": "virtue",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "doc_to_text": "Sentence: {{scenario}}\nQuestion: Does the character in this sentence exhibit the trait \"{{trait}}\"?\nAnswer:",
+ "doc_to_target": "label",
+ "doc_to_choice": [
+ "no",
+ "yes"
+ ],
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "acc"
+ }
+ ],
+ "output_type": "multiple_choice",
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ }
+ },
+ "versions": {
+ "ethics_cm": 1.0,
+ "ethics_deontology": 1.0,
+ "ethics_justice": 1.0,
+ "ethics_utilitarianism": 1.0,
+ "ethics_virtue": 1.0
+ },
+ "n-shot": {
+ "ethics_cm": 0,
+ "ethics_deontology": 0,
+ "ethics_justice": 0,
+ "ethics_utilitarianism": 0,
+ "ethics_virtue": 0
+ },
+ "higher_is_better": {
+ "ethics_cm": {
+ "acc": true
+ },
+ "ethics_deontology": {
+ "acc": true
+ },
+ "ethics_justice": {
+ "acc": true
+ },
+ "ethics_utilitarianism": {
+ "acc": true
+ },
+ "ethics_virtue": {
+ "acc": true
+ }
+ },
+ "n-samples": {
+ "ethics_cm": {
+ "original": 3885,
+ "effective": 3885
+ },
+ "ethics_justice": {
+ "original": 2704,
+ "effective": 2704
+ },
+ "ethics_virtue": {
+ "original": 4975,
+ "effective": 4975
+ },
+ "ethics_deontology": {
+ "original": 3596,
+ "effective": 3596
+ },
+ "ethics_utilitarianism": {
+ "original": 4808,
+ "effective": 4808
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737961961.397722,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "ethics_cm": "088ead6c08bb523b9de2bf5098b07ad2d484b8d19d068937634e20e4a776db84",
+ "ethics_justice": "29e70305fd625a6fa42aa154ef0c4fcd7ffbfce91483485d61ef01ebaab02235",
+ "ethics_virtue": "b3e6efc9b8e5a591f9e9bd96c14a97d118c29455f4441e52d97b10b404513a55",
+ "ethics_deontology": "5311ba877c2291b107da9263731e4895484636a7fdce77b31855eb34cc6c2a37",
+ "ethics_utilitarianism": "50e3b75384c265c6c5fb9691f46a46b22a44ffb07d131e285b5f0a84b1025bc8"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 991101.332318416,
+ "end_time": 991237.205268011,
+ "total_evaluation_time_seconds": "135.87294959498104"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/ifeval_0_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/ifeval_0_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..14bbdccdeae5e292f15d126bee606e109d36c976
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/ifeval_0_shot.json
@@ -0,0 +1,134 @@
+{
+ "results": {
+ "ifeval": {
+ "alias": "ifeval",
+ "prompt_level_strict_acc,none": 0.4436229205175601,
+ "prompt_level_strict_acc_stderr,none": 0.021379361149596345,
+ "inst_level_strict_acc,none": 0.5851318944844125,
+ "inst_level_strict_acc_stderr,none": "N/A",
+ "prompt_level_loose_acc,none": 0.49168207024029575,
+ "prompt_level_loose_acc_stderr,none": 0.021513596564021183,
+ "inst_level_loose_acc,none": 0.6187050359712231,
+ "inst_level_loose_acc_stderr,none": "N/A"
+ }
+ },
+ "group_subtasks": {
+ "ifeval": []
+ },
+ "configs": {
+ "ifeval": {
+ "task": "ifeval",
+ "dataset_path": "google/IFEval",
+ "test_split": "train",
+ "doc_to_text": "prompt",
+ "doc_to_target": 0,
+ "process_results": "def process_results(doc, results):\n inp = InputExample(\n key=doc[\"key\"],\n instruction_id_list=doc[\"instruction_id_list\"],\n prompt=doc[\"prompt\"],\n kwargs=doc[\"kwargs\"],\n )\n response = results[0]\n\n out_strict = test_instruction_following_strict(inp, response)\n out_loose = test_instruction_following_loose(inp, response)\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "num_fewshot": 0,
+ "metric_list": [
+ {
+ "metric": "prompt_level_strict_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_strict_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ },
+ {
+ "metric": "prompt_level_loose_acc",
+ "aggregation": "mean",
+ "higher_is_better": true
+ },
+ {
+ "metric": "inst_level_loose_acc",
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [],
+ "do_sample": false,
+ "temperature": 0.0,
+ "max_gen_toks": 1280
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 4.0
+ }
+ }
+ },
+ "versions": {
+ "ifeval": 4.0
+ },
+ "n-shot": {
+ "ifeval": 0
+ },
+ "higher_is_better": {
+ "ifeval": {
+ "prompt_level_strict_acc": true,
+ "inst_level_strict_acc": true,
+ "prompt_level_loose_acc": true,
+ "inst_level_loose_acc": true
+ }
+ },
+ "n-samples": {
+ "ifeval": {
+ "original": 541,
+ "effective": 541
+ }
+ },
+ "config": {
+ "model": "vllm",
+ "model_args": "pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,data_parallel_size=8,gpu_memory_utilization=0.8,download_dir=/tmp,enforce_eager=True",
+ "batch_size": 1,
+ "batch_sizes": [],
+ "device": null,
+ "use_cache": null,
+ "limit": null,
+ "bootstrap_iters": 100000,
+ "gen_kwargs": null,
+ "random_seed": 0,
+ "numpy_seed": 1234,
+ "torch_seed": 1234,
+ "fewshot_seed": 1234
+ },
+ "git_hash": "788a3672",
+ "date": 1737968143.925328,
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
+ "transformers_version": "4.48.1",
+ "upper_git_hash": null,
+ "tokenizer_pad_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_eos_token": [
+ "<|eot_id|>",
+ "128009"
+ ],
+ "tokenizer_bos_token": [
+ "<|begin_of_text|>",
+ "128000"
+ ],
+ "eot_token_id": 128009,
+ "max_length": 131072,
+ "task_hashes": {
+ "ifeval": "a9cc24d7d92904c9f59225bb28b88b892d9ab82be222808ea7fa345ffd4500ae"
+ },
+ "model_source": "vllm",
+ "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "model_name_sanitized": "meta-llama__Meta-Llama-3.1-8B-Instruct",
+ "system_instruction": null,
+ "system_instruction_sha": null,
+ "fewshot_as_multiturn": false,
+ "chat_template": null,
+ "chat_template_sha": null,
+ "start_time": 1677873.808264766,
+ "end_time": 1678076.48068606,
+ "total_evaluation_time_seconds": "202.67242129403166"
+}
\ No newline at end of file
diff --git a/evaluations/en/Meta-Llama-3.1-8B-Instruct/minerva_math_4_shot.json b/evaluations/en/Meta-Llama-3.1-8B-Instruct/minerva_math_4_shot.json
new file mode 100644
index 0000000000000000000000000000000000000000..ed7650769a902ba1446f8751ff657f922631f906
--- /dev/null
+++ b/evaluations/en/Meta-Llama-3.1-8B-Instruct/minerva_math_4_shot.json
@@ -0,0 +1,529 @@
+{
+ "results": {
+ "minerva_math": {
+ "exact_match,none": 0.3426,
+ "exact_match_stderr,none": 0.00626883548076138,
+ "alias": "minerva_math"
+ },
+ "minerva_math_algebra": {
+ "alias": " - minerva_math_algebra",
+ "exact_match,none": 0.4928390901432182,
+ "exact_match_stderr,none": 0.014517208529270137
+ },
+ "minerva_math_counting_and_prob": {
+ "alias": " - minerva_math_counting_and_prob",
+ "exact_match,none": 0.3059071729957806,
+ "exact_match_stderr,none": 0.021187174233958342
+ },
+ "minerva_math_geometry": {
+ "alias": " - minerva_math_geometry",
+ "exact_match,none": 0.27348643006263046,
+ "exact_match_stderr,none": 0.02038805554382814
+ },
+ "minerva_math_intermediate_algebra": {
+ "alias": " - minerva_math_intermediate_algebra",
+ "exact_match,none": 0.1362126245847176,
+ "exact_match_stderr,none": 0.011421123769972273
+ },
+ "minerva_math_num_theory": {
+ "alias": " - minerva_math_num_theory",
+ "exact_match,none": 0.23703703703703705,
+ "exact_match_stderr,none": 0.01831746837581445
+ },
+ "minerva_math_prealgebra": {
+ "alias": " - minerva_math_prealgebra",
+ "exact_match,none": 0.5889781859931114,
+ "exact_match_stderr,none": 0.016681012759620913
+ },
+ "minerva_math_precalc": {
+ "alias": " - minerva_math_precalc",
+ "exact_match,none": 0.16117216117216118,
+ "exact_match_stderr,none": 0.015750095129187364
+ }
+ },
+ "groups": {
+ "minerva_math": {
+ "exact_match,none": 0.3426,
+ "exact_match_stderr,none": 0.00626883548076138,
+ "alias": "minerva_math"
+ }
+ },
+ "group_subtasks": {
+ "minerva_math": [
+ "minerva_math_algebra",
+ "minerva_math_counting_and_prob",
+ "minerva_math_geometry",
+ "minerva_math_intermediate_algebra",
+ "minerva_math_num_theory",
+ "minerva_math_prealgebra",
+ "minerva_math_precalc"
+ ]
+ },
+ "configs": {
+ "minerva_math_algebra": {
+ "task": "minerva_math_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_counting_and_prob": {
+ "task": "minerva_math_counting_and_prob",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "counting_and_probability",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_geometry": {
+ "task": "minerva_math_geometry",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "geometry",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_intermediate_algebra": {
+ "task": "minerva_math_intermediate_algebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "intermediate_algebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_num_theory": {
+ "task": "minerva_math_num_theory",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "number_theory",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": ""
+ },
+ "num_fewshot": 4,
+ "metric_list": [
+ {
+ "metric": "exact_match",
+ "aggregation": "mean",
+ "higher_is_better": true
+ }
+ ],
+ "output_type": "generate_until",
+ "generation_kwargs": {
+ "until": [
+ "Problem:"
+ ],
+ "do_sample": false,
+ "temperature": 0.0
+ },
+ "repeats": 1,
+ "should_decontaminate": false,
+ "metadata": {
+ "version": 1.0
+ }
+ },
+ "minerva_math_prealgebra": {
+ "task": "minerva_math_prealgebra",
+ "tag": [
+ "math_word_problems"
+ ],
+ "group": [
+ "math_word_problems"
+ ],
+ "dataset_path": "EleutherAI/hendrycks_math",
+ "dataset_name": "prealgebra",
+ "dataset_kwargs": {
+ "trust_remote_code": true
+ },
+ "training_split": "train",
+ "test_split": "test",
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_doc(doc: dict) -> dict:\n out_doc = {\n \"problem\": doc[\"problem\"],\n \"solution\": doc[\"solution\"],\n \"answer\": normalize_final_answer(\n remove_boxed(last_boxed_only_string(doc[\"solution\"]))\n ),\n }\n if getattr(doc, \"few_shot\", None) is not None:\n out_doc[\"few_shot\"] = True\n return out_doc\n\n return dataset.map(_process_doc)\n",
+ "doc_to_text": "def doc_to_text(doc: dict) -> str:\n return \"Problem:\" + \"\\n\" + doc[\"problem\"] + \"\\n\\n\" + \"Solution:\"\n",
+ "doc_to_target": "{{answer if few_shot is undefined else solution}}",
+ "process_results": "def process_results(doc: dict, results: List[str]) -> Dict[str, int]:\n candidates = results[0]\n\n unnormalized_answer = get_unnormalized_answer(candidates)\n answer = normalize_final_answer(unnormalized_answer)\n\n if is_equiv(answer, doc[\"answer\"]):\n retval = 1\n else:\n retval = 0\n\n results = {\n \"exact_match\": retval,\n }\n return results\n",
+ "description": "",
+ "target_delimiter": " ",
+ "fewshot_delimiter": "\n\n",
+ "fewshot_config": {
+ "sampler": "first_n",
+ "samples": "